Dependency-aware job scheduling in the cluster is NP-hard. Recent work shows
that Deep Reinforcement Learning (DRL) is capable of solving it. It is
difficult for the administrator to understand the DRL-based policy even though
it achieves remarkable performance gain. Therefore the complex model-based
scheduler is not easy to gain trust in the system where simplicity is favored.
In this paper, we give the multi-level explanation framework to interpret the
policy of DRL-based scheduling. We dissect its decision-making process to job
level and task level and approximate each level with interpretable models and
rules, which align with operational practices. We show that the framework gives
the system administrator insights into the state-of-the-art scheduler and
reveals the robustness issue in regards to its behavior pattern.Comment: Accepted in the MLSys'22 Workshop on Cloud Intelligence / AIOp