4 research outputs found
End-to-end anomaly detection system in the CERN Openstack Cloud infrastructure
The CERN Openstack private Data Center offers Cloud resources, services and tools to a scientific community of about 3,000 CERN users. In the Infrastructure, about 14,000 Virtual Machines are deployed, covering several use cases, from web front-ends to databases and analytics platforms. Cloud service managers have to make sure that the desired computational power is delivered to all the users, and to accomplish this task, spotting anomalous server machines in time is crucial. The previous adopted solution consists in monitoring the performance metrics of the machines using a threshold-based alarming system. In this thesis, we present the new Anomaly Detection (AD) system that currently runs in the CERN Cloud Infrastructure. Given the mentioned multi-variate time series metrics, we run three different unsupervised Machine Learning models: Isolation Forest, LSTM-AutoEncoder and GRU-AutoEncoder. Then, using an ensemble approach, we propose daily to the CERN Cloud managers, in an automatic way, the most anomalous servers of the previous day. We show the related end-to-end pipeline going from the data sources to the detected anomalies, the details of the architecture of the system, the pre-processing steps implemented, and the design choices regarding our solution. Furthermore, we present a new labelled evaluation dataset related to the CERN Cloud case study, and the results, with respect to this dataset, of our experiments comparing the three models we use in the system. In particular, in terms of AUC-ROC, we show that the three adopted models, despite their very different nature, have all high performance (AUC-ROC > 0.95), and that they all outperform the previous threshold-based system in terms of true positive rate, for the given false positive rate required by the Data Center's operators. In addition, we compare the time performance of the models, and we show that the training is robust to the selection and size of the training data
25th International Conference on Computing in High Energy & Nuclear Physics
Anomaly detection in the CERN OpenStack cloud is a challenging task due to the large scale of the computing infrastructure and, consequently, the large volume of monitoring data to analyse. The current solution to spot anomalous servers in the cloud infrastructure relies on a threshold-based alarming system carefully set by the system managers on the performance metrics of each infrastructure’s component. This contribution explores fully automated, unsupervised machine learning solutions in the anomaly detection field for time series metrics, by adapting both traditional and deep learning approaches. The paper describes a novel end-to-end data analytics pipeline implemented to digest the large amount of monitoring data and to expose anomalies to the system managers. The pipeline relies solely on open-source tools and frameworks, such as Spark, Apache Airflow, Kubernetes, Grafana, Elasticsearch. In addition, an approach to build annotated datasets from the CERN cloud monitoring data is reported. Finally, a preliminary performance of a number of anomaly detection algorithms is evaluated by using the aforementioned annotated datasets
Anomaly detection in the CERN cloud infrastructure
Anomaly detection in the CERN OpenStack cloud is a challenging task due to the large scale of the computing infrastructure and, consequently, the large volume of monitoring data to analyse. The current solution to spot anomalous servers in the cloud infrastructure relies on a threshold-based alarming system carefully set by the system managers on the performance metrics of each infrastructure’s component. This contribution explores fully automated, unsupervised machine learning solutions in the anomaly detection field for time series metrics, by adapting both traditional and deep learning approaches. The paper describes a novel end-to-end data analytics pipeline implemented to digest the large amount of monitoring data and to expose anomalies to the system managers. The pipeline relies solely on open-source tools and frameworks, such as Spark, Apache Airflow, Kubernetes, Grafana, Elasticsearch. In addition, an approach to build annotated datasets from the CERN cloud monitoring data is reported. Finally, a preliminary performance of a number of anomaly detection algorithms is evaluated by using the aforementioned annotated datasets
Lightweight and Scalable Model for Tweet Engagements Predictions in a Resource-constrained Environment
In this paper we provide an overview of the approach we used as team Trial&Error for the ACM RecSys Challenge 2021. The competition, organized by Twitter, addresses the problem of predicting different categories of user engagements (Like, Reply, Retweet and Retweet with Comment), given a dataset of previous interactions on the Twitter platform. Our proposed method relies on efficiently leveraging the massive amount of data, crafting a wide variety of features and designing a lightweight solution. This results in a significant reduction of computational resources requirements, both during the training and inference phase. The final model, an optimized LightGBM, allowed our team to reach the 4th position in the final leaderboard and to rank 1st among the academic teams