Active job monitoring in pilots

Abstract

Recent developments in high energy physics (HEP) including multi-core jobs and multi-core pilots require data centres to gain a deep understanding of the system to monitor, design, and upgrade computing clusters. Networking is a critical component. Especially the increased usage of data federations, for example in diskless computing centres or as a fall-back solution, relies on WAN connectivity and availability. The specific demands of different experiments and communities, but also the need for identification of misbehaving batch jobs, requires an active monitoring. Existing monitoring tools are not capable of measuring fine-grained information at batch job level. This complicates network-aware scheduling and optimisations. In addition, pilots add another layer of abstraction. They behave like batch systems themselves by managing and executing payloads of jobs internally. The number of real jobs being executed is unknown, as the original batch system has no access to internal information about the scheduling process inside the pilots. Therefore, the comparability of jobs and pilots for predicting run-time behaviour or network performance cannot be ensured. Hence, identifying the actual payload is important. At the GridKa Tier 1 centre a specific tool is in use that allows the monitoring of network traffc information at batch job level. This contribution presents the current monitoring approach and discusses recent e_orts and importance to identify pilots and their substructures inside the batch system. It will also show how to determine monitoring data of specific jobs from identified pilots. Finally, the approach is evaluated

    Similar works