575 research outputs found

    AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges

    Full text link
    Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big data generated by IT Operations processes, particularly in cloud infrastructures, to provide actionable insights with the primary goal of maximizing availability. There are a wide variety of problems to address, and multiple use-cases, where AI capabilities can be leveraged to enhance operational efficiency. Here we provide a review of the AIOps vision, trends challenges and opportunities, specifically focusing on the underlying AI techniques. We discuss in depth the key types of data emitted by IT Operations activities, the scale and challenges in analyzing them, and where they can be helpful. We categorize the key AIOps tasks as - incident detection, failure prediction, root cause analysis and automated actions. We discuss the problem formulation for each task, and then present a taxonomy of techniques to solve these problems. We also identify relatively under explored topics, especially those that could significantly benefit from advances in AI literature. We also provide insights into the trends in this field, and what are the key investment opportunities

    Classification in sparse, high dimensional environments applied to distributed systems failure prediction

    Get PDF
    Network failures are still one of the main causes of distributed systems’ lack of reliability. To overcome this problem we present an improvement over a failure prediction system, based on Elastic Net Logistic Regression and the application of rare events prediction techniques, able to work with sparse, high dimensional datasets. Specifically, we prove its stability, fine tune its hyperparameter and improve its industrial utility by showing that, with a slight change in dataset creation, it can also predict the location of a failure, a key asset when trying to take a proactive approach to failure management

    System failure prediction through rare-events elastic-net logistic regression

    Full text link
    Predicting failures in a distributed system based on previous events through logistic regression is a standard approach in literature. This technique is not reliable, though, in two situations: in the prediction of rare events, which do not appear in enough proportion for the algorithm to capture, and in environments where there are too many variables, as logistic regression tends to overfit on this situations; while manually selecting a subset of variables to create the model is error- prone. On this paper, we solve an industrial research case that presented this situation with a combination of elastic net logistic regression, a method that allows us to automatically select useful variables, a process of cross-validation on top of it and the application of a rare events prediction technique to reduce computation time. This process provides two layers of cross- validation that automatically obtain the optimal model complexity and the optimal mode l parameters values, while ensuring even rare events will be correctly predicted with a low amount of training instances. We tested this method against real industrial data, obtaining a total of 60 out of 80 possible models with a 90% average model accuracy

    IoT Anomaly Detection Methods and Applications: A Survey

    Full text link
    Ongoing research on anomaly detection for the Internet of Things (IoT) is a rapidly expanding field. This growth necessitates an examination of application trends and current gaps. The vast majority of those publications are in areas such as network and infrastructure security, sensor monitoring, smart home, and smart city applications and are extending into even more sectors. Recent advancements in the field have increased the necessity to study the many IoT anomaly detection applications. This paper begins with a summary of the detection methods and applications, accompanied by a discussion of the categorization of IoT anomaly detection algorithms. We then discuss the current publications to identify distinct application domains, examining papers chosen based on our search criteria. The survey considers 64 papers among recent publications published between January 2019 and July 2021. In recent publications, we observed a shortage of IoT anomaly detection methodologies, for example, when dealing with the integration of systems with various sensors, data and concept drifts, and data augmentation where there is a shortage of Ground Truth data. Finally, we discuss the present such challenges and offer new perspectives where further research is required.Comment: 22 page

    The terminator : an AI-based framework to handle dependability threats in large-scale distributed systems

    Get PDF
    With the advent of resource-hungry applications such as scientific simulations and artificial intelligence (AI), the need for high-performance computing (HPC) infrastructure is becoming more pressing. HPC systems are typically characterised by the scale of the resources they possess, containing a large number of sophisticated HW components that are tightly integrated. This scale and design complexity inherently contribute to sources of uncertainties, i.e., there are dependability threats that perturb the system during application execution. During system execution, these HPC systems generate a massive amount of log messages that capture the health status of the various components. Several previous works have leveraged those systems’ logs for dependability purposes, such as failure prediction, with varying results. In this work, three novel AI-based techniques are proposed to address two major dependability problems, those of (i) error detection and (ii) failure prediction. The proposed error detection technique leverages the sentiments embedded in log messages in a novel way, making the approach HPC system-independent, i.e., the technique can be used to detect errors in any HPC system. On the other hand, two novel self-supervised transformer neural networks are developed for failure prediction, thereby obviating the need for labels, which are notoriously difficult to obtain in HPC systems. The first transformer technique, called Clairvoyant, accurately predicts the location of the failure, while the second technique, called Time Machine, extends Clairvoyant by also accurately predicting the lead time to failure (LTTF). Time Machine addresses the typical regression problem of LTTF as a novel multi-class classification problem, using a novel oversampling method for online time-based task training. Results from six real-world HPC clusters’ datasets show that our approaches significantly outperform the state-of-the-art methods on various metrics

    Clairvoyant : a log-based transformer-decoder for failure prediction in large-scale systems

    Get PDF
    System failures are expected to be frequent in the exascale era such as current Petascale systems. The health of such systems is usually determined from challenging analysis of large amounts of unstructured & redundant log data. In this paper, we leverage log data and propose Clairvoyant, a novel self-supervised (i.e., no labels needed) model to predict node failures in HPC systems based on a recent deep learning approach called transformer-decoder and the self-attention mechanism. Clairvoyant predicts node failures by (i) predicting a sequence of log events and then (ii) identifying if a failure is a part of that sequence. We carefully evaluate Clairvoyant and another state-of-the-art failure prediction approach – Desh, based on two real-world system log datasets. Experiments show that Clairvoyant is significantly better: e.g., it can predict node failures with an average Bleu, Rouge, and MCC scores of 0.90, 0.78, and 0.65 respectively while Desh scores only 0.58, 0.58, and 0.25. More importantly, this improvement is achieved with faster training and prediction time, with Clairvoyant being about 25× and 15× faster than Desh respectively

    A Cognitive Framework to Secure Smart Cities

    Get PDF
    The advancement in technology has transformed Cyber Physical Systems and their interface with IoT into a more sophisticated and challenging paradigm. As a result, vulnerabilities and potential attacks manifest themselves considerably more than before, forcing researchers to rethink the conventional strategies that are currently in place to secure such physical systems. This manuscript studies the complex interweaving of sensor networks and physical systems and suggests a foundational innovation in the field. In sharp contrast with the existing IDS and IPS solutions, in this paper, a preventive and proactive method is employed to stay ahead of attacks by constantly monitoring network data patterns and identifying threats that are imminent. Here, by capitalizing on the significant progress in processing power (e.g. petascale computing) and storage capacity of computer systems, we propose a deep learning approach to predict and identify various security breaches that are about to occur. The learning process takes place by collecting a large number of files of different types and running tests on them to classify them as benign or malicious. The prediction model obtained as such can then be used to identify attacks. Our project articulates a new framework for interactions between physical systems and sensor networks, where malicious packets are repeatedly learned over time while the system continually operates with respect to imperfect security mechanisms
    • …
    corecore