103 research outputs found

    Fault Injection Analytics: A Novel Approach to Discover Failure Modes in Cloud-Computing Systems

    Full text link
    Cloud computing systems fail in complex and unexpected ways due to unexpected combinations of events and interactions between hardware and software components. Fault injection is an effective means to bring out these failures in a controlled environment. However, fault injection experiments produce massive amounts of data, and manually analyzing these data is inefficient and error-prone, as the analyst can miss severe failure modes that are yet unknown. This paper introduces a new paradigm (fault injection analytics) that applies unsupervised machine learning on execution traces of the injected system, to ease the discovery and interpretation of failure modes. We evaluated the proposed approach in the context of fault injection experiments on the OpenStack cloud computing platform, where we show that the approach can accurately identify failure modes with a low computational cost.Comment: IEEE Transactions on Dependable and Secure Computing; 16 pages. arXiv admin note: text overlap with arXiv:1908.1164

    Review and Analysis of Failure Detection and Prevention Techniques in IT Infrastructure Monitoring

    Get PDF
    Maintaining the health of IT infrastructure components for improved reliability and availability is a research and innovation topic for many years. Identification and handling of failures are crucial and challenging due to the complexity of IT infrastructure. System logs are the primary source of information to diagnose and fix failures. In this work, we address three essential research dimensions about failures, such as the need for failure handling in IT infrastructure, understanding the contribution of system-generated log in failure detection and reactive & proactive approaches used to deal with failure situations. This study performs a comprehensive analysis of existing literature by considering three prominent aspects as log preprocessing, anomaly & failure detection, and failure prevention. With this coherent review, we (1) presume the need for IT infrastructure monitoring to avoid downtime, (2) examine the three types of approaches for anomaly and failure detection such as a rule-based, correlation method and classification, and (3) fabricate the recommendations for researchers on further research guidelines. As far as the authors\u27 knowledge, this is the first comprehensive literature review on IT infrastructure monitoring techniques. The review has been conducted with the help of meta-analysis and comparative study of machine learning and deep learning techniques. This work aims to outline significant research gaps in the area of IT infrastructure failure detection. This work will help future researchers understand the advantages and limitations of current methods and select an adequate approach to their problem

    AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges

    Full text link
    Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big data generated by IT Operations processes, particularly in cloud infrastructures, to provide actionable insights with the primary goal of maximizing availability. There are a wide variety of problems to address, and multiple use-cases, where AI capabilities can be leveraged to enhance operational efficiency. Here we provide a review of the AIOps vision, trends challenges and opportunities, specifically focusing on the underlying AI techniques. We discuss in depth the key types of data emitted by IT Operations activities, the scale and challenges in analyzing them, and where they can be helpful. We categorize the key AIOps tasks as - incident detection, failure prediction, root cause analysis and automated actions. We discuss the problem formulation for each task, and then present a taxonomy of techniques to solve these problems. We also identify relatively under explored topics, especially those that could significantly benefit from advances in AI literature. We also provide insights into the trends in this field, and what are the key investment opportunities

    A Machine Learning Enhanced Scheme for Intelligent Network Management

    Get PDF
    The versatile networking services bring about huge influence on daily living styles while the amount and diversity of services cause high complexity of network systems. The network scale and complexity grow with the increasing infrastructure apparatuses, networking function, networking slices, and underlying architecture evolution. The conventional way is manual administration to maintain the large and complex platform, which makes effective and insightful management troublesome. A feasible and promising scheme is to extract insightful information from largely produced network data. The goal of this thesis is to use learning-based algorithms inspired by machine learning communities to discover valuable knowledge from substantial network data, which directly promotes intelligent management and maintenance. In the thesis, the management and maintenance focus on two schemes: network anomalies detection and root causes localization; critical traffic resource control and optimization. Firstly, the abundant network data wrap up informative messages but its heterogeneity and perplexity make diagnosis challenging. For unstructured logs, abstract and formatted log templates are extracted to regulate log records. An in-depth analysis framework based on heterogeneous data is proposed in order to detect the occurrence of faults and anomalies. It employs representation learning methods to map unstructured data into numerical features, and fuses the extracted feature for network anomaly and fault detection. The representation learning makes use of word2vec-based embedding technologies for semantic expression. Next, the fault and anomaly detection solely unveils the occurrence of events while failing to figure out the root causes for useful administration so that the fault localization opens a gate to narrow down the source of systematic anomalies. The extracted features are formed as the anomaly degree coupled with an importance ranking method to highlight the locations of anomalies in network systems. Two types of ranking modes are instantiated by PageRank and operation errors for jointly highlighting latent issue of locations. Besides the fault and anomaly detection, network traffic engineering deals with network communication and computation resource to optimize data traffic transferring efficiency. Especially when network traffic are constrained with communication conditions, a pro-active path planning scheme is helpful for efficient traffic controlling actions. Then a learning-based traffic planning algorithm is proposed based on sequence-to-sequence model to discover hidden reasonable paths from abundant traffic history data over the Software Defined Network architecture. Finally, traffic engineering merely based on empirical data is likely to result in stale and sub-optimal solutions, even ending up with worse situations. A resilient mechanism is required to adapt network flows based on context into a dynamic environment. Thus, a reinforcement learning-based scheme is put forward for dynamic data forwarding considering network resource status, which explicitly presents a promising performance improvement. In the end, the proposed anomaly processing framework strengthens the analysis and diagnosis for network system administrators through synthesized fault detection and root cause localization. The learning-based traffic engineering stimulates networking flow management via experienced data and further shows a promising direction of flexible traffic adjustment for ever-changing environments

    Managing Smartphone Testbeds with SmartLab

    Get PDF
    The explosive number of smartphones with ever growing sensing and computing capabilities have brought a paradigm shift to many traditional domains of the computing field. Re-programming smartphones and instrumenting them for application testing and data gathering at scale is currently a tedious and time-consuming process that poses significant logistical challenges. In this paper, we make three major contributions: First, we propose a comprehensive architecture, coined SmartLab1, for managing a cluster of both real and virtual smartphones that are either wired to a private cloud or connected over a wireless link. Second, we propose and describe a number of Android management optimizations (e.g., command pipelining, screen-capturing, file management), which can be useful to the community for building similar functionality into their systems. Third, we conduct extensive experiments and microbenchmarks to support our design choices providing qualitative evidence on the expected performance of each module comprising our architecture. This paper also overviews experiences of using SmartLab in a research-oriented setting and also ongoing and future development efforts

    Advanced Elastic Platforms for High Throughput Computing on Container-based and Serverless Infrastructures

    Full text link
    [ES] El principal objetivo de esta tesis es ofrecer a los usuarios científicos un modo de crear y ejecutar aplicaciones sin servidor (i.e. serverless) altamente paralelas, dirigidas por eventos y orientadas al procesado de datos, tanto en proveedores en la nube públicos (e.g. AWS) como privados (e.g. OpenNebula, OpenStack). Para llevar a cabo dicho objetivo, se han desarrollado e integrado diferentes herramientas que ofrecen una vía para desplegar aplicaciones de computación de altas prestaciones basadas en contenedores, que además pueden beneficiarse de la alta escalabilidad presente en los entornos serverless. Primero se ha creado una herramienta que permite el despliegue de cargas de trabajo genéricas en el proveedor público AWS. Esta herramienta posibilita que se puedan aprovechar las funcionalidades de AWS Lambda (e.g. alta escalabilidad, computación basada en eventos) para el despliegue y la integración de aplicaciones computacionalmente intensivas que usan el modelo de funciones como servicio (FaaS). En segundo lugar se ha desarrollado un modelo de programación de alto rendimiento para el procesado de datos y orientado a eventos que permite a los usuarios desplegar flujos de trabajo como un conjunto de funciones serverless, a la vez que ofrece una gestión transparente de los datos. En tercer lugar, para poder superar los problemas presentes en los proveedores públicos (e.g. tiempo de ejecución limitado), se ha creado una plataforma que facilita el uso del modelo FaaS en infraestructuras privadas. Esta plataforma también puede ser desplegada automáticamente en distintos proveedores públicos de la nube. Finalmente, para comprobar y validar las diferentes herramientas y plataformas desarrolladas, se han probado diferentes casos de uso con interés tanto para investigación como para la empresa.[CA] El principal objectiu d'aquesta tesi és oferir als usuaris científics una manera de crear i executar aplicacions sense servidor (i.e. serverless) altament paral·leles, dirigides per esdeveniments i orientades al processament de dades, tant en proveïdors en núvol públics (e.g. AWS) com en privats (e.g. OpenNebula, OpenStack). Per a dur a terme aquest objectiu, s'ha desenvolupat e integrat diferents eines que ofereixen una via per desplegar aplicacions de computació d'altes prestacions basades en contenidors, alhora que es poden beneficiar de l'alta escalabilitat present en els entorns serverless. Primerament, s'ha creat una eina que possibilita el desplegament de càrregues de treball genèriques al proveïdor públic en núvol AWS. Aquesta eina permet aprofitar les funcionalitats de AWS Lambda (e.g. alta escalabilitat, computació basada en esdeveniments) per al desplegament i la integració d'aplicacions computacionalment intensives que fan ús del model de funcions com a servei (FaaS). En segon lloc, s'ha desenvolupat un model de programació d'alt rendiment per al processament de dades i orientat a esdeveniments, que permet als usuaris desplegar fluxos de treball com un conjunt de funcions serverless, alhora que ofereix una gestió transparent de les dades. En tercer lloc, per a superar els problemes presents als proveïdors públics (e.g. temps d'execució limitat) s'ha creat una plataforma que permet utilitzar el model FaaS en infraestructures privades. A més, aquesta plataforma pot ser desplegada automàticament en múltiples proveïdors públics en núvol. Finalment, per a comprobar i validar les diferents eines i plataformes dutes a terme, s'han provat diferents casos d'ús amb interès tant per a la recerca com per a l'empresa.[EN] The main objective of this thesis is to allow scientific users to deploy and execute highly-parallel event-driven file-processing serverless applications both in public (e.g. AWS), and in private (e.g. OpenNebula, OpenStack) cloud infrastructures. To achieve this objective, different tools and platforms are developed and integrated to provide scientific users with a way for deploying High Throughput Computing applications based on containers that can benefit from the high elasticity capabilities of the serverless environments. First, an open-source tool to deploy generic serverless workloads in the AWS public Cloud provider has been created. This tool allows the scientific users to benefit from the features of AWS Lambda (e.g. high scalability, event-driven computing) for the deployment and integration of compute-intensive applications that use the Functions as a Service (FaaS) model. Second, an event-driven file-processing high-throughput programming model has been developed to allow the users deploy generic applications as workflows of functions in serverless architectures, offering transparent data management. Third, in order to overcome the drawbacks of public serverless services such as limited execution time or computing capabilities, an open-source platform to support FaaS for compute-intensive applications in on-premises Clouds was created. The platform can be automatically deployed on multi-Clouds in order to create highly-parallel event-driven file-processing serverless applications. Finally, in order to assess and validate all the developed tools and platforms, several use cases with business and scientific backgrounds have been tested.Pérez González, AM. (2020). Advanced Elastic Platforms for High Throughput Computing on Container-based and Serverless Infrastructures [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/146365TESI

    Logging Statements Analysis and Automation in Software Systems with Data Mining and Machine Learning Techniques

    Get PDF
    Log files are widely used to record runtime information of software systems, such as the timestamp of an event, the name or ID of the component that generated the log, and parts of the state of a task execution. The rich information of logs enables system developers (and operators) to monitor the runtime behavior of their systems and further track down system problems in development and production settings. With the ever-increasing scale and complexity of modern computing systems, the volume of logs is rapidly growing. For example, eBay reported that the rate of log generation on their servers is in the order of several petabytes per day in 2018 [17]. Therefore, the traditional way of log analysis that largely relies on manual inspection (e.g., searching for error/warning keywords or grep) has become an inefficient, a labor intensive, error-prone, and outdated task. The growth of the logs has initiated the emergence of automated tools and approaches for log mining and analysis. In parallel, the embedding of logging statements in the source code is a manual and error-prone task, and developers often might forget to add a logging statement in the software's source code. To address the logging challenge, many e orts have aimed to automate logging statements in the source code, and in addition, many tools have been proposed to perform large-scale log le analysis by use of machine learning and data mining techniques. However, the current logging process is yet mostly manual, and thus, proper placement and content of logging statements remain as challenges. To overcome these challenges, methods that aim to automate log placement and content prediction, i.e., `where and what to log', are of high interest. In addition, approaches that can automatically mine and extract insight from large-scale logs are also well sought after. Thus, in this research, we focus on predicting the log statements, and for this purpose, we perform an experimental study on open-source Java projects. We introduce a log-aware code-clone detection method to predict the location and description of logging statements. Additionally, we incorporate natural language processing (NLP) and deep learning methods to further enhance the performance of the log statements' description prediction. We also introduce deep learning based approaches for automated analysis of software logs. In particular, we analyze execution logs and extract natural language characteristics of logs to enable the application of natural language models for automated log le analysis. Then, we propose automated tools for analyzing log files and measuring the information gain from logs for different log analysis tasks such as anomaly detection. We then continue our NLP-enabled approach by leveraging the state-of-the-art language models, i.e., Transformers, to perform automated log parsing

    Data Analytics as a Service: A look inside the PANACEA project

    Get PDF
    • …
    corecore