4 research outputs found
Traffic generation for benchmarking data centre networks
Benchmarking is commonly used in research fields, such as computer architecture design and machine learning, as a powerful paradigm for rigorously assessing, comparing, and developing novel technologies. However, the data centre network (DCN) community lacks a standard open-access and reproducible traffic generation framework for benchmark workload generation. Driving factors behind this include the proprietary nature of traffic traces, the limited detail and quantity of open-access network-level data sets, the high cost of real world experimentation, and the poor reproducibility and fidelity of synthetically generated traffic. This is curtailing the community's understanding of existing systems and hindering the ability with which novel technologies, such as optical DCNs, can be developed, compared, and tested. We present TrafPy; an open-access framework for generating both realistic and custom DCN traffic traces. TrafPy is compatible with any simulation, emulation, or experimentation environment, and can be used for standardised benchmarking and for investigating the properties and limitations of network systems such as schedulers, switches, routers, and resource managers. We give an overview of the TrafPy traffic generation framework, and provide a brief demonstration of its efficacy through an investigation into the sensitivity of some canonical scheduling algorithms to varying traffic trace characteristics in the context of optical DCNs. TrafPy is open-sourced via GitHub and all data associated with this manuscript via RDR
Progressive load balancing of asynchronous algorithms
Massively parallel supercomputers are susceptible to variable performance due to
factors such as differences in chip manufacturing, heat management and network congestion. As a result, the same code with the same input can have a different execution
time from run to run. Synchronisation under these circumstances is a key challenge
that prevents applications from scaling to large problems and machines.
Asynchronous algorithms offer a partial solution. In these algorithms fast processes
are not forced to synchronise with slower ones. Instead, they continue computing updates, and moving towards the solution, using the latest data available to them, which
may have become stale (i.e. the data is a number of iterations out of date compared
to the most recent version). While this allows for high computational efficiency, the
convergence rate of asynchronous algorithms tends to be lower than synchronous algorithms due to the use of stale values. A large degree of performance variability can
eliminate the performance advantage of asynchronous algorithms or even cause the
results to diverge.
To address this problem, we use the unique properties of asynchronous algorithms
to develop a load balancing strategy for iterative convergent asynchronous algorithms
in both shared and distributed memory. The proposed approach – Progressive Load
Balancing (PLB) – aims to balance progress levels over time, rather than attempting to
equalise iteration rates across parallel workers. This approach attenuates noise without
sacrificing performance, resulting in a significant reduction in progress imbalance and
improving time to solution.
The developed method is evaluated in a variety of scenarios using the asynchronous
Jacobi algorithm. In shared memory, we show that it can essentially eliminate the
negative effects of a single core in a node slowed down by 19%. Work stealing, an
alternative load balancing approach, is shown to be ineffective. In distributed memory,
the method reduces the impact of up to 8 slow nodes out of 15, each slowed down
by 40%, resulting in 1.03×–1.10× reduction in time to solution and 1.11×–2.89×
reduction in runtime variability. Furthermore, we successfully apply the method in
a scenario with real faulty components running 75% slower than normal. Broader
applicability of progressive load balancing is established by emulating its application
to asynchronous stochastic gradient descent where it is found to improve both training
time and the learned model’s accuracy.
Overall, this thesis demonstrates that enhancing asynchronous algorithms with
PLB is an effective method for tackling performance variability in supercomputers
Application-centric bandwidth allocation in datacenters
Today's datacenters host a large number of concurrently executing applications with diverse intra-datacenter latency and bandwidth requirements.
Some of these applications, such as data analytics, graph processing, and machine learning training, are data-intensive and require high bandwidth to function properly.
However, these bandwidth-hungry applications can often congest the datacenter network, leading to queuing delays that hurt application completion time.
To remove the network as a potential performance bottleneck, datacenter operators have begun deploying high-end HPC-grade networks like InfiniBand.
These networks offer fully offloaded network stacks, remote direct memory access (RDMA) capability, and non-discarding links, which allow them to provide both low latency and high bandwidth for a single application.
However, it is unclear how well such networks accommodate a mix of latency- and bandwidth-sensitive traffic in a real-world deployment.
In this thesis, we aim to answer the above question.
To do so, we develop RPerf, a latency measurement tool for RDMA-based networks that can precisely measure the InfiniBand switch latency without hardware support.
Using RPerf, we benchmark a rack-scale InfiniBand cluster in both isolated and mixed-traffic scenarios.
Our key finding is that the evaluated switch can provide either low latency or high bandwidth, but not both simultaneously in a mixed-traffic scenario.
We also evaluate several options to improve the latency-bandwidth trade-off and demonstrate that none are ideal.
We find that while queue separation is a solution to protect latency-sensitive applications, it fails to properly manage the bandwidth of other applications.
We also aim to resolve the problem with bandwidth management for non-latency-sensitive applications.
Previous efforts to address this problem have generally focused on achieving max-min fairness at the flow level.
However, we observe that different workloads exhibit varying levels of sensitivity to network bandwidth.
For some workloads, even a small reduction in available bandwidth can significantly increase completion time, while for others, completion time is largely insensitive to available network bandwidth.
As a result, simply splitting the bandwidth equally among all workloads is sub-optimal for overall application-level performance.
To address this issue, we first propose a robust methodology capable of effectively measuring the sensitivity of applications to bandwidth.
We then design Saba, an application-aware bandwidth allocation framework that distributes network bandwidth based on application-level sensitivity.
Saba combines ahead-of-time application profiling to determine bandwidth sensitivity with runtime bandwidth allocation using lightweight software support, with no modifications to network hardware or protocols.
Experiments with a 32-server hardware testbed show that Saba can significantly increase overall performance by reducing the job completion time for bandwidth-sensitive jobs