Search CORE

3 research outputs found

Active job monitoring in pilots

Author: Fischer M.
Giffels M.
Jung C.
Kuehn E.
Petzold A.
Publication venue: IOP Publishing
Publication date: 08/02/2016
Field of study

Recent developments in high energy physics (HEP) including multi-core jobs and multi-core pilots require data centres to gain a deep understanding of the system to monitor, design, and upgrade computing clusters. Networking is a critical component. Especially the increased usage of data federations, for example in diskless computing centres or as a fall-back solution, relies on WAN connectivity and availability. The specific demands of different experiments and communities, but also the need for identification of misbehaving batch jobs, requires an active monitoring. Existing monitoring tools are not capable of measuring fine-grained information at batch job level. This complicates network-aware scheduling and optimisations. In addition, pilots add another layer of abstraction. They behave like batch systems themselves by managing and executing payloads of jobs internally. The number of real jobs being executed is unknown, as the original batch system has no access to internal information about the scheduling process inside the pilots. Therefore, the comparability of jobs and pilots for predicting run-time behaviour or network performance cannot be ensured. Hence, identifying the actual payload is important. At the GridKa Tier 1 centre a specific tool is in use that allows the monitoring of network traffc information at batch job level. This contribution presents the current monitoring approach and discusses recent e_orts and importance to identify pilots and their substructures inside the batch system. It will also show how to determine monitoring data of specific jobs from identified pilots. Finally, the approach is evaluated

KITopen

A scalable architecture for online anomaly detection of WLCG batch jobs

Author: Fischer M
Giffels M.
Jung C.
Kuehn E.
Petzold A.
Publication venue: IOP Publishing
Publication date: 01/10/2016
Field of study

For data centres it is increasingly important to monitor the network usage, and learn from network usage patterns. Especially con_guration issues or misbehaving batch jobs preventing a smooth operation need to be detected as early as possible. At the GridKa data and computing centre we therefore operate a tool BPNetMon for monitoring tra_c data and characteristics of WLCG batch jobs and pilots locally on di_erent worker nodes. On the one hand local information itself are not su_cient to detect anomalies for several reasons, e.g. the underlying job distribution on a single worker node might change or there might be a local miscon_guration. On the other hand a centralised anomaly detection approach does not scale regarding network communication as well as computational costs. We therefore propose a scalable architecture based on concepts of a super-peer network

Crossref

KITopen

Online Analysis of Dynamic Streaming Data

Author: Kühn Eileen
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2018
Field of study

Die Arbeit zum Thema "Online Analysis of Dynamic Streaming Data" beschäftigt sich mit der Distanzmessung dynamischer, semistrukturierter Daten in kontinuierlichen Datenströmen um Analysen auf diesen Datenstrukturen bereits zur Laufzeit zu ermöglichen. Hierzu wird eine Formalisierung zur Distanzberechnung für statische und dynamische Bäume eingeführt und durch eine explizite Betrachtung der Dynamik von Attributen einzelner Knoten der Bäume ergänzt. Die Echtzeitanalyse basierend auf der Distanzmessung wird durch ein dichte-basiertes Clustering ergänzt, um eine Anwendung des Clustering, einer Klassifikation, aber auch einer Anomalieerkennung zu demonstrieren. Die Ergebnisse dieser Arbeit basieren auf einer theoretischen Analyse der eingeführten Formalisierung von Distanzmessungen für dynamische Bäume. Diese Analysen werden unterlegt mit empirischen Messungen auf Basis von Monitoring-Daten von Batchjobs aus dem Batchsystem des GridKa Daten- und Rechenzentrums. Die Evaluation der vorgeschlagenen Formalisierung sowie der darauf aufbauenden Echtzeitanalysemethoden zeigen die Effizienz und Skalierbarkeit des Verfahrens. Zudem wird gezeigt, dass die Betrachtung von Attributen und Attribut-Statistiken von besonderer Bedeutung für die Qualität der Ergebnisse von Analysen dynamischer, semistrukturierter Daten ist. Außerdem zeigt die Evaluation, dass die Qualität der Ergebnisse durch eine unabhängige Kombination mehrerer Distanzen weiter verbessert werden kann. Insbesondere wird durch die Ergebnisse dieser Arbeit die Analyse sich über die Zeit verändernder Daten ermöglicht

KITopen