16 research outputs found

    Why does Prediction Accuracy Decrease over Time? Uncertain Positive Learning for Cloud Failure Prediction

    Full text link
    With the rapid growth of cloud computing, a variety of software services have been deployed in the cloud. To ensure the reliability of cloud services, prior studies focus on failure instance (disk, node, and switch, etc.) prediction. Once the output of prediction is positive, mitigation actions are taken to rapidly resolve the underlying failure. According to our real-world practice in Microsoft Azure, we find that the prediction accuracy may decrease by about 9% after retraining the models. Considering that the mitigation actions may result in uncertain positive instances since they cannot be verified after mitigation, which may introduce more noise while updating the prediction model. To the best of our knowledge, we are the first to identify this Uncertain Positive Learning (UPLearning) issue in the real-world cloud failure prediction scenario. To tackle this problem, we design an Uncertain Positive Learning Risk Estimator (Uptake) approach. Using two real-world datasets of disk failure prediction and conducting node prediction experiments in Microsoft Azure, which is a top-tier cloud provider that serves millions of users, we demonstrate Uptake can significantly improve the failure prediction accuracy by 5% on average

    Whose Fault is It? Correctly Attributing Outages in Cloud Services

    Get PDF
    Cloud availability is a major performance parameter in cloud Service Level Agreements (SLA). Its correct evaluation is essential to SLA enforcement and possible litigation issues. Current methods fail to correctly identify the fault location, since they include the network contribution. We propose a procedure to identify the failures actually due to the cloud itself and provide a correct cloud availability measure. The procedure employs tools that are freely available, i.e. traceroute and whois, and arrives at the availability measure by first identifying the boundaries of the cloud. We evaluate our procedure by testing it on three major cloud providers: Google Cloud, Amazon AWS, and Rackspace. The results show that the procedure arrives at a correct identification in 95% of cases. The cloud availability obtained in the test after correct identification lies between 3 and 4 nines for the three platforms under test

    Studying the Characteristics of AIOps Projects on GitHub

    Full text link
    Artificial Intelligence for IT Operations (AIOps) leverages AI approaches to handle the massive data generated during the operations of software systems. Prior works have proposed various AIOps solutions to support different tasks in system operations and maintenance (e.g., anomaly detection). In this work, we investigate open-source AIOps projects in-depth to understand the characteristics of AIOps in practice. We first carefully identify a set of AIOps projects from GitHub and analyze their repository metrics (e.g., the used programming languages). Then, we qualitatively study the projects to understand their input data, analysis techniques, and goals. Finally, we analyze the quality of these projects using different quality metrics, such as the number of bugs. We also sample two sets of baseline projects from GitHub: a random sample of machine learning projects, and a random sample of general purpose projects. We compare different metrics of our identified AIOps projects with these baselines. Our results show a recent and growing interest in AIOps solutions. However, the quality metrics indicate that AIOps projects suffer from more issues than our baseline projects. We also pinpoint the most common issues in AIOps approaches and discuss the possible solutions to overcome them. Our findings help practitioners and researchers understand the current state of AIOps practices and sheds light to different ways to improve AIOps weak aspects. To the best of our knowledge, this work is the first to characterize open source AIOps projects.Comment: 31 pages, 6 pages of references, 8 figures, 12 table

    Fault Injection Analytics: A Novel Approach to Discover Failure Modes in Cloud-Computing Systems

    Full text link
    Cloud computing systems fail in complex and unexpected ways due to unexpected combinations of events and interactions between hardware and software components. Fault injection is an effective means to bring out these failures in a controlled environment. However, fault injection experiments produce massive amounts of data, and manually analyzing these data is inefficient and error-prone, as the analyst can miss severe failure modes that are yet unknown. This paper introduces a new paradigm (fault injection analytics) that applies unsupervised machine learning on execution traces of the injected system, to ease the discovery and interpretation of failure modes. We evaluated the proposed approach in the context of fault injection experiments on the OpenStack cloud computing platform, where we show that the approach can accurately identify failure modes with a low computational cost.Comment: IEEE Transactions on Dependable and Secure Computing; 16 pages. arXiv admin note: text overlap with arXiv:1908.1164

    An Empirical Study of Runtime Files Attached to Crash Reports

    Get PDF
    When a software system crashes, users report the crash using crash report tracking tools. A crash report (CR) is then routed to software developers for review to fix the problem. A CR contain a wealth of information that allow developers to diagnose the root causes of problems and provides fixes. This is particularly important at Ericsson, one of the world’s largest Telecom company, in which this study was conducted. The handling of CRs at Ericsson goes through multiple lines of supports until a solution is provided. To make this possible, Ericsson software engineers and network operators rely on runtime data that is collected during the crash. This data is organized into files that are attached to the CRs. However, not all CRs contain this data in the first place. Software engineers and network operators often have to request additional files after the CR is created and sent to different Ericsson support lines, a problem that often delays the resolution process. In this thesis, we conduct an empirical study of the runtime files attached to Ericsson CRs. We focus on answering four research questions that revolved around the proportion of runtime files in a selected set of CRs, the relationship between the severity of CRs and the type of files they contain, the impact of different file types on the time to fix the CR, and the possibility to predict whether a CR should have runtime data attached to it at the CR submission time. Our ultimate goal is to understand how runtime data is used during the CR handling process at Ericsson and what recommendations we can make to improve this process

    Leveraging data-driven infrastructure management to facilitate AIOps for big data applications and operations

    Get PDF
    As institutions increasingly shift to distributed and containerized application deployments on remote heterogeneous cloud/cluster infrastructures, the cost and difficulty of efficiently managing and maintaining data-intensive applications have risen. A new emerging solution to this issue is Data-Driven Infrastructure Management (DDIM), where the decisions regarding the management of resources are taken based on data aspects and operations (both on the infrastructure and on the application levels). This chapter will introduce readers to the core concepts underpinning DDIM, based on experience gained from development of the Kubernetes-based BigDataStack DDIM platform (https://bigdatastack.eu/). This chapter involves multiple important BDV topics, including development, deployment, and operations for cluster/cloud-based big data applications, as well as data-driven analytics and artificial intelligence for smart automated infrastructure self-management. Readers will gain important insights into how next-generation DDIM platforms function, as well as how they can be used in practical deployments to improve quality of service for Big Data Applications. This chapter relates to the technical priority Data Processing Architectures of the European Big Data Value Strategic Research & Innovation Agenda [33], as well as the Data Processing Architectures horizontal and Engineering and DevOps for building Big Data Value vertical concerns. The chapter relates to the Reasoning and Decision Making cross-sectorial technology enablers of the AI, Data and Robotics Strategic Research, Innovation & Deployment Agenda [34]

    logram: efficient log paring using n-gram model

    Get PDF
    Software systems usually record important runtime information in their logs. Logs help practitioners understand system runtime behaviors and diagnose field failures. As logs are usually very large in size, automated log analysis is needed to assist practitioners in their software operation and maintenance efforts. Typically, the first step of automated log analysis is log parsing, i.e., converting unstructured raw logs into structured data. However, log parsing is challenging, because logs are produced by static templates in the source code (i.e., logging statements) yet the templates are usually inaccessible when parsing logs. Prior work proposed automated log parsing approaches that have achieved high accuracy. However, as the volume of logs grows rapidly in the era of cloud computing, efficiency becomes a major concern in log parsing. In this work, we propose an automated log parsing approach, Logram, which leverages n-gram dictionaries to achieve efficient log parsing. We evaluated Logram on 16 public log datasets and compared Logram with five state-of-the-art log parsing approaches. We found that Logram achieves a higher parsing accuracy than the best existing approaches (i.e., at least 10% higher, on average) and also outperforms these approaches in efficiency (i.e., 1.8 to 5.1 times faster than the second-fastest approaches in terms of end-to-end parsing time). Furthermore, we deployed Logram on Spark and we found that Logram scales out efficiently with the number of Spark nodes (e.g., with near- linear scalability for some logs) without sacrificing parsing accuracy. In addition, we demonstrated that Logram can support effective online parsing of logs, achieving similar parsing results and efficiency to the offline mode
    corecore