18 research outputs found
MLTCP: Congestion Control for DNN Training
We present MLTCP, a technique to augment today's congestion control
algorithms to accelerate DNN training jobs in shared GPU clusters. MLTCP
enables the communication phases of jobs that compete for network bandwidth to
interleave with each other, thereby utilizing the network efficiently. At the
heart of MLTCP lies a very simple principle based on a key conceptual insight:
DNN training flows should scale their congestion window size based on the
number of bytes sent at each training iteration. We show that integrating this
principle into today's congestion control protocols is straightforward: by
adding 30-60 lines of code to Reno, CUBIC, or DCQCN, MLTCP stabilizes flows of
different jobs into an interleaved state within a few training iterations,
regardless of the number of competing flows or the start time of each flow. Our
experiments with popular DNN training jobs demonstrate that enabling MLTCP
accelerates the average and 99th percentile training iteration time by up to 2x
and 4x, respectively
On a single server queue with negative arrivals and request repeated
There is a growing interest in queueing systems with negative arrivals; i.e. where the arrival of a negative customer has the effect of deleting some customer in the queue. Recently, Hanison and Pitel (1996) investigated the queue length distribution of a single server queue of type M/G/1 with negative arrivals. In this paper we extend the analysis to the context of queueing systems with request repeated. We show that the Limiting distribution of the system state can still be reduced to a Fredholm integral equation. We solve such an equation numerically by introducing an auxiliary 'truncated' system which can easily be evaluated with the help of a regenerative approach