Given the cost of HPC clusters, making best use of them is crucial to improve
infrastructure ROI. Likewise, reducing failed HPC jobs and related waste in
terms of user wait times is crucial to improve HPC user productivity (aka human
ROI). While most efforts (e.g.,debugging HPC programs) explore technical
aspects to improve ROI of HPC clusters, we hypothesize non-technical (human)
aspects are worth exploring to make non-trivial ROI gains; specifically,
understanding non-technical aspects and how they contribute to the failure of
HPC jobs.
In this regard, we conducted a case study in the context of Beocat cluster at
Kansas State University. The purpose of the study was to learn the reasons why
users terminate jobs and to quantify wasted computations in such jobs in terms
of system utilization and user wait time. The data from the case study helped
identify interesting and actionable reasons why users terminate HPC jobs. It
also helped confirm that user terminated jobs may be associated with
non-trivial amount of wasted computation, which if reduced can help improve the
ROI of HPC clusters.Comment: Minor formatting and content update based on reader feedbac