research article

Sub-millisecond-level network monitoring system for intelligent computing centers network

Abstract

As intelligent computing centers become core infrastructure supporting the high-quality development of the digital economy, the training of hundred-billion-parameter large models imposes stringent requirements on network performance. Traditional monitoring methods struggle to address communication bottlenecks in ten-thousand-card clusters due to insufficient sampling precision and lack of fine-grained observation. A sub-millisecond-level network monitoring system (sMon) was proposed, which integrated intelligent counters and dynamic bandwidth analysis modules into the workflow processing pipeline to enable real-time tracking of NIC port queue depth and traffic fluctuations. By implementing dynamic bandwidth calculation through sliding window algorithms, sub-millisecond temporal accuracy was maintained while reducing system overhead to 0.8% CPU utilization via asynchronous log collection mechanisms. Testing on a 128-node A100 cluster, the system’s capability was demonstrated to capture sub-millisecond traffic details at network ports. Experimental results show a two-order-of-magnitude improvement in data granularity compared with conventional monitoring solutions. The proposed system provides real-time monitoring and performance assurance for constructing “ultra-large-scale, ultra-high-bandwidth, ultra-reliable” intelligent computing center networks through high-precision network state perception, effectively supporting the requirements of large-scale AI training tasks

    Similar works