Understanding cluster-wide I/O patterns of large-scale HPC clusters is
essential to minimize the occurrence and impact of I/O interference. Yet, most
previous work in this area focused on monitoring and predicting task and
node-level I/O burst events. This paper analyzes Darshan reports from three
supercomputers to extract system-level read and write I/O rates in five minutes
intervals. We observe significant (over 100x) fluctuations in read and write
I/O rates in all three clusters. We then train machine learning models to
estimate the occurrence of system-level I/O bursts 5 - 120 minutes ahead.
Evaluation results show that we can predict I/O bursts with more than 90%
accuracy (F-1 score) five minutes ahead and more than 87% accuracy two hours
ahead. We also show that the ML models attain more than 70% accuracy when
estimating the degree of the I/O burst. We believe that high-accuracy
predictions of I/O bursts can be used in multiple ways, such as postponing
delay-tolerant I/O operations (e.g., checkpointing), pausing nonessential
applications (e.g., file system scrubbers), and devising I/O-aware job
scheduling methods. To validate this claim, we simulated a burst-aware job
scheduler that can postpone the start time of applications to avoid I/O bursts.
We show that the burst-aware job scheduling can lead to an up to 5x decrease in
application runtime.Comment: 10 pages, 11 figures, 2 table