Distributed Machine Learning (DML) systems are utilized to enhance the speed
of model training in data centers (DCs) and edge nodes. The Parameter Server
(PS) communication architecture is commonly employed, but it faces severe
long-tail latency caused by many-to-one "incast" traffic patterns, negatively
impacting training throughput. To address this challenge, we design the
\textbf{L}oss-tolerant \textbf{T}ransmission \textbf{P}rotocol (LTP), which
permits partial loss of gradients during synchronization to avoid unneeded
retransmission and contributes to faster synchronization per iteration. LTP
implements loss-tolerant transmission through \textit{out-of-order
transmission} and \textit{out-of-order Acknowledges (ACKs)}. LTP employs
\textit{Early Close} to adjust the loss-tolerant threshold based on network
conditions and \textit{Bubble Filling} for data correction to maintain training
accuracy. LTP is implemented by C++ and integrated into PyTorch. Evaluations on
a testbed of 8 worker nodes and one PS node demonstrate that LTP can
significantly improve DML training task throughput by up to 30x compared to
traditional TCP congestion controls, with no sacrifice to final accuracy.Comment: This paper will be published on IWQoS 2023. Preview version onl