386 research outputs found
Recommended from our members
Performance analysis and improvement of InfiniBand networks. Modelling and effective Quality-of-Service mechanisms for interconnection networks in cluster computing systems.
The InfiniBand Architecture (IBA) network has been proposed as a new
industrial standard with high-bandwidth and low-latency suitable for constructing
high-performance interconnected cluster computing systems. This architecture
replaces the traditional bus-based interconnection with a switch-based network for
the server Input-Output (I/O) and inter-processor communications. The efficient
Quality-of-Service (QoS) mechanism is fundamental to ensure the import at QoS
metrics, such as maximum throughput and minimum latency, leaving aside other
aspects like guarantee to reduce the delay, blocking probability, and mean queue
length, etc.
Performance modelling and analysis has been and continues to be of great
theoretical and practical importance in the design and development of
communication networks. This thesis aims to investigate efficient and cost-effective
QoS mechanisms for performance analysis and improvement of InfiniBand
networks in cluster-based computing systems.
Firstly, a rate-based source-response link-by-link admission and congestion
control function with improved Explicit Congestion Notification (ECN) packet
marking scheme is developed. This function adopts the rate control to reduce
congestion of multiple-class traffic. Secondly, a credit-based flow control scheme is
presented to reduce the mean queue length, throughput and response time of the system. In order to evaluate the performance of this scheme, a new queueing
network model is developed. Theoretical analysis and simulation experiments show
that these two schemes are quite effective and suitable for InfiniBand networks.
Finally, to obtain a thorough and deep understanding of the performance attributes
of InfiniBand Architecture network, two efficient threshold function flow control
mechanisms are proposed to enhance the QoS of InfiniBand networks; one is Entry
Threshold that sets the threshold for each entry in the arbitration table, and other is
Arrival Job Threshold that sets the threshold based on the number of jobs in each
Virtual Lane. Furthermore, the principle of Maximum Entropy is adopted to analyse
these two new mechanisms with the Generalized Exponential (GE)-Type
distribution for modelling the inter-arrival times and service times of the input traffic.
Extensive simulation experiments are conducted to validate the accuracy of the
analytical models
Management, Optimization and Evolution of the LHCb Online Network
The LHCb experiment is one of the four large particle detectors running at the
Large Hadron Collider (LHC) at CERN. It is a forward single-arm spectrometer dedicated to test the Standard Model through precision measurements of
Charge-Parity (CP) violation and rare decays in the b quark sector. The LHCb
experiment will operate at a luminosity of 2x10^32cm-2s-1, the proton-proton
bunch crossings rate will be approximately 10 MHz. To select the interesting
events, a two-level trigger scheme is applied: the rst level trigger (L0) and the
high level trigger (HLT). The L0 trigger is implemented in custom hardware,
while HLT is implemented in software runs on the CPUs of the Event Filter
Farm (EFF). The L0 trigger rate is dened at about 1 MHz, and the event size
for each event is about 35 kByte. It is a serious challenge to handle the resulting
data rate (35 GByte/s).
The Online system is a key part of the LHCb experiment, providing all the
IT services. It consists of three major components: the Data Acquisition (DAQ)
system, the Timing and Fast Control (TFC) system and the Experiment Control
System (ECS). To provide the services, two large dedicated networks based on
Gigabit Ethernet are deployed: one for DAQ and another one for ECS, which are
referred to Online network in general. A large network needs sophisticated monitoring for its successful operation. Commercial network management systems are
quite expensive and dicult to integrate into the LHCb ECS. A custom network
monitoring system has been implemented based on a Supervisory Control And
Data Acquisition (SCADA) system called PVSS which is used by LHCb ECS. It
is a homogeneous part of the LHCb ECS. In this thesis, it is demonstrated how
a large scale network can be monitored and managed using tools originally made
for industrial supervisory control.
The thesis is organized as the follows:
Chapter 1 gives a brief introduction to LHC and the B physics on LHC,
then describes all sub-detectors and the trigger and DAQ system of LHCb from
structure to performance.
Chapter 2 first introduces the LHCb Online system and the dataflow, then
focuses on the Online network design and its optimization.
In Chapter 3, the SCADA system PVSS is introduced briefly,
then the
architecture and implementation of the network monitoring system are described
in detail, including the front-end processes, the data communication and the
supervisory layer.
Chapter 4 first discusses the packet sampling theory and one of the packet
sampling mechanisms: sFlow, then demonstrates the applications of sFlow for
the network trouble-shooting, the traffic monitoring and the anomaly detection.
In Chapter 5, the upgrade of LHC and LHCb is introduced, the possible
architecture of DAQ is discussed, and two candidate internetworking technologies (high speed Ethernet and InfniBand) are compared in different aspects for
DAQ. Three schemes based on 10 Gigabit Ethernet are presented and studied.
Chapter 6 is a general summary of the thesis
Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks
In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables.
Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions
Hybrid High Performance Computing (HPC) + Cloud for Scientific Computing
The HPC+Cloud framework has been built to enable on-premise HPC jobs to use resources from cloud computing nodes. As part of designing the software framework, public cloud providers, namely Amazon AWS, Microsoft Azure and NeCTAR were benchmarked against one another, and Microsoft Azure was determined to be the most suitable cloud component in the proposed HPC+Cloud software framework. Finally, an HPC+Cloud cluster was built using the HPC+Cloud software framework and then was validated by conducting HPC processing benchmarks
Exascale Deep Learning for Climate Analytics
We extract pixel-level masks of extreme weather patterns using variants of
Tiramisu and DeepLabv3+ neural networks. We describe improvements to the
software frameworks, input pipeline, and the network training algorithms
necessary to efficiently scale deep learning on the Piz Daint and Summit
systems. The Tiramisu network scales to 5300 P100 GPUs with a sustained
throughput of 21.0 PF/s and parallel efficiency of 79.0%. DeepLabv3+ scales up
to 27360 V100 GPUs with a sustained throughput of 325.8 PF/s and a parallel
efficiency of 90.7% in single precision. By taking advantage of the FP16 Tensor
Cores, a half-precision version of the DeepLabv3+ network achieves a peak and
sustained throughput of 1.13 EF/s and 999.0 PF/s respectively.Comment: 12 pages, 5 tables, 4, figures, Super Computing Conference November
11-16, 2018, Dallas, TX, US
- …