A pipeline architecture for feature-based unsupervised clustering using multivariate time series from HPC jobs

Abstract

[Abstract]: Time series are key across industrial and research areas for their ability to model behaviour across time, making them ideal for a wide range of use cases such as event monitoring, trend prediction or anomaly detection. This is even more so due to the increasing monitoring capabilities in many areas, with the subsequent massive data generation. But it is also interesting to consider the potential of time series for Machine Learning processing, often fused with Big Data, to search for useful information and solve real-world problems. However, time series can be studied individually, representing a single entity or variable to be analysed, or in a grouped fashion, to study and represent a more complex entity or scenario. In this latter case we are dealing with multivariate time series, which usually imply different approaches when dealt with. In this paper, we present a pipeline architecture to process and cluster multiple groups of multivariate time series. To implement this, we apply a multi-process solution composed by a feature-based extraction stage, followed by a dimension reduction, and finally, several clustering algorithms. The pipeline is also highly configurable in terms of the stage techniques to be used, allowing to perform a search with several combinations for the most promising results. The pipeline has been experimentally applied to batches of HPC jobs from different users of a supercomputer, with the multivariate time series coming from the monitoring of several node resource metrics. The results show how it is possible to apply this multi-process information fusion to create different meaningful clusters from the batches, using only the time series, without any labelling information, thus being an unsupervised scenario. Optionally, the pipeline also supports an outlier detection stage to find and separate jobs that are radically different when compared to others on a dataset. These outliers can be removed for a better clustering, and later reviewed looking for anomalies, or if numerous, fed back to the pipeline to identify possible groupings. The results also include some outliers found in the experiments, as well as scenarios where they are clustered, or ignored and not removed at all. In addition, by leveraging Big Data technologies like Spark, the pipeline is proven to be scalable by working with up to hundreds of jobs and thousands of time series.Xunta de Galicia; ED431G 2019/01Xunta de Galicia; ED431C 2021/30This research was funded by the Ministry of Science and Innovation of Spain (PID2019-104184RB-I00/AEI/10.13039/501100011033), and by Xunta de Galicia, Spain and FEDER funds of the European Union (Centro de Investigación de Galicia accreditation 2019–2022, ref. ED431G 2019/01; Consolidation Program of Competitive Reference Groups, ref. ED431C 2021/30). Funding for open access charge: Universidade da Coruña/CISUG

    Similar works