Search CORE

575 research outputs found

AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges

Author: Cheng Qian
Hoi Steven C. H.
Liu Chenghao
Saha Amrita
Sahoo Doyen
Saverese Silvio
Singh Manpreet
Woo Gerald
Yang Wenzhuo
Publication venue
Publication date: 10/04/2023
Field of study

Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big data generated by IT Operations processes, particularly in cloud infrastructures, to provide actionable insights with the primary goal of maximizing availability. There are a wide variety of problems to address, and multiple use-cases, where AI capabilities can be leveraged to enhance operational efficiency. Here we provide a review of the AIOps vision, trends challenges and opportunities, specifically focusing on the underlying AI techniques. We discuss in depth the key types of data emitted by IT Operations activities, the scale and challenges in analyzing them, and where they can be helpful. We categorize the key AIOps tasks as - incident detection, failure prediction, root cause analysis and automated actions. We discuss the problem formulation for each task, and then present a taxonomy of techniques to solve these problems. We also identify relatively under explored topics, especially those that could significantly benefit from advances in AI literature. We also provide insights into the trends in this field, and what are the key investment opportunities

arXiv.org e-Print Archive

Classification in sparse, high dimensional environments applied to distributed systems failure prediction

Author: A.S. Tanenbaum
B. Schroeder
F. Salfner
G. King
H. Zou
M. Gallet
N. Trendafilov
W. Ahmed
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Network failures are still one of the main causes of distributed systems’ lack of reliability. To overcome this problem we present an improvement over a failure prediction system, based on Elastic Net Logistic Regression and the application of rare events prediction techniques, able to work with sparse, high dimensional datasets. Specifically, we prove its stability, fine tune its hyperparameter and improve its industrial utility by showing that, with a slight change in dataset creation, it can also predict the location of a failure, a key asset when trying to take a proactive approach to failure management

Crossref

Archivo Digital UPM

System failure prediction through rare-events elastic-net logistic regression

Author: Dueñas López Juan Carlos
Navarro González José Manuel
Parada Gélvez Hugo Alexer
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

Predicting failures in a distributed system based on previous events through logistic regression is a standard approach in literature. This technique is not reliable, though, in two situations: in the prediction of rare events, which do not appear in enough proportion for the algorithm to capture, and in environments where there are too many variables, as logistic regression tends to overfit on this situations; while manually selecting a subset of variables to create the model is error- prone. On this paper, we solve an industrial research case that presented this situation with a combination of elastic net logistic regression, a method that allows us to automatically select useful variables, a process of cross-validation on top of it and the application of a rare events prediction technique to reduce computation time. This process provides two layers of cross- validation that automatically obtain the optimal model complexity and the optimal mode l parameters values, while ensuring even rare events will be correctly predicted with a low amount of training instances. We tested this method against real industrial data, obtaining a total of 60 out of 80 possible models with a 90% average model accuracy

Archivo Digital UPM

IoT Anomaly Detection Methods and Applications: A Survey

Author: Ahmed Bestoun S.
Chatterjee Ayan
Publication venue: 'Elsevier BV'
Publication date: 01/01/2022
Field of study

Ongoing research on anomaly detection for the Internet of Things (IoT) is a rapidly expanding field. This growth necessitates an examination of application trends and current gaps. The vast majority of those publications are in areas such as network and infrastructure security, sensor monitoring, smart home, and smart city applications and are extending into even more sectors. Recent advancements in the field have increased the necessity to study the many IoT anomaly detection applications. This paper begins with a summary of the detection methods and applications, accompanied by a discussion of the categorization of IoT anomaly detection algorithms. We then discuss the current publications to identify distinct application domains, examining papers chosen based on our search criteria. The survey considers 64 papers among recent publications published between January 2019 and July 2021. In recent publications, we observed a shortage of IoT anomaly detection methodologies, for example, when dealing with the integration of systems with various sensors, data and concept drifts, and data augmentation where there is a shortage of Ground Truth data. Finally, we discuss the present such challenges and offer new perspectives where further research is required.Comment: 22 page

arXiv.org e-Print Archive

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Publikationer från Karlstads Universitet

Performance Anomaly Detection and Bottleneck Identification

Author: Alpaydin E.
Barham Paul
Berkhin Pavel
Bodík Peter
Brey Jack
Burke Shaun
Chung Hsin
Cohen Ira
Dean Daniel J.
Fodor Imola K.
Frank
Fu Song
Fu Song
Gregg Brendan
Guan Qiang
Gunther Neil J.
Huang Su-Yun
Igor
Jeffrey
John
Kang Hui
Kelly Terence
Kotsiantis S. B.
Lee Han Bok
Lee Wenke
Lilja David J.
Malkowski Simon
McHugh Andrew
Oakland John S.
Panourgias Iakovos
Reiss Charles
Reynolds Douglas
Sambasivan Raja R.
Shallahamer Craig A.
Shende Sameer
Tan Yongmin
Tarby Jean-Claude
Trubin Igor
Wang Chengwei
Wang Haichuan
Wang Tao
Wilder John
Yu Minlan
Zhang Qi
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

The terminator : an AI-based framework to handle dependability threats in large-scale distributed systems

Author: Alharthi Khalid Ayed
Publication venue
Publication date
Field of study

With the advent of resource-hungry applications such as scientific simulations and artificial intelligence (AI), the need for high-performance computing (HPC) infrastructure is becoming more pressing. HPC systems are typically characterised by the scale of the resources they possess, containing a large number of sophisticated HW components that are tightly integrated. This scale and design complexity inherently contribute to sources of uncertainties, i.e., there are dependability threats that perturb the system during application execution. During system execution, these HPC systems generate a massive amount of log messages that capture the health status of the various components. Several previous works have leveraged those systems’ logs for dependability purposes, such as failure prediction, with varying results. In this work, three novel AI-based techniques are proposed to address two major dependability problems, those of (i) error detection and (ii) failure prediction. The proposed error detection technique leverages the sentiments embedded in log messages in a novel way, making the approach HPC system-independent, i.e., the technique can be used to detect errors in any HPC system. On the other hand, two novel self-supervised transformer neural networks are developed for failure prediction, thereby obviating the need for labels, which are notoriously difficult to obtain in HPC systems. The first transformer technique, called Clairvoyant, accurately predicts the location of the failure, while the second technique, called Time Machine, extends Clairvoyant by also accurately predicting the lead time to failure (LTTF). Time Machine addresses the typical regression problem of LTTF as a novel multi-class classification problem, using a novel oversampling method for online time-based task training. Results from six real-world HPC clusters’ datasets show that our approaches significantly outperform the state-of-the-art methods on various metrics

Warwick Research Archives Portal Repository

Clairvoyant : a log-based transformer-decoder for failure prediction in large-scale systems

Author: Alharthi Khalid
Cappello Franck
Jhumka Arshad
Sheng Di
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 28/06/2022
Field of study

System failures are expected to be frequent in the exascale era such as current Petascale systems. The health of such systems is usually determined from challenging analysis of large amounts of unstructured & redundant log data. In this paper, we leverage log data and propose Clairvoyant, a novel self-supervised (i.e., no labels needed) model to predict node failures in HPC systems based on a recent deep learning approach called transformer-decoder and the self-attention mechanism. Clairvoyant predicts node failures by (i) predicting a sequence of log events and then (ii) identifying if a failure is a part of that sequence. We carefully evaluate Clairvoyant and another state-of-the-art failure prediction approach – Desh, based on two real-world system log datasets. Experiments show that Clairvoyant is significantly better: e.g., it can predict node failures with an average Bleu, Rouge, and MCC scores of 0.90, 0.78, and 0.65 respectively while Desh scores only 0.58, 0.58, and 0.25. More importantly, this improvement is achieved with faster training and prediction time, with Clairvoyant being about 25× and 15× faster than Desh respectively

Warwick Research Archives Portal Repository

A Cognitive Framework to Secure Smart Cities

Author: Latifi Shahram
Pirouz Matin
Raste Neha
Tayeb Shahab
Publication venue: Digital Scholarship@UNLV
Publication date: 26/09/2018
Field of study

The advancement in technology has transformed Cyber Physical Systems and their interface with IoT into a more sophisticated and challenging paradigm. As a result, vulnerabilities and potential attacks manifest themselves considerably more than before, forcing researchers to rethink the conventional strategies that are currently in place to secure such physical systems. This manuscript studies the complex interweaving of sensor networks and physical systems and suggests a foundational innovation in the field. In sharp contrast with the existing IDS and IPS solutions, in this paper, a preventive and proactive method is employed to stay ahead of attacks by constantly monitoring network data patterns and identifying threats that are imminent. Here, by capitalizing on the significant progress in processing power (e.g. petascale computing) and storage capacity of computer systems, we propose a deep learning approach to predict and identify various security breaches that are about to occur. The learning process takes place by collecting a large number of files of different types and running tests on them to classify them as benign or malicious. The prediction model obtained as such can then be used to identify attacks. Our project articulates a new framework for interactions between physical systems and sensor networks, where malicious packets are repeatedly learned over time while the system continually operates with respect to imperfect security mechanisms

EDP Sciences OAI-PMH repository (1.2.0)

University of Nevada, Las Vegas Repository