17 research outputs found

    Mining event logs with SLCT and LogHound

    Full text link

    Understanding error log event sequence for failure analysis

    Get PDF
    Due to the evolvement of large-scale parallel systems, they are mostly employed for mission critical applications. The anticipation and accommodation of failure occurrences is crucial to the design. A commonplace feature of these large-scale systems is failure, and they cannot be treated as exception. The system state is mostly captured through the logs. The need for proper understanding of these error logs for failure analysis is extremely important. This is because the logs contain the “health” information of the system. In this paper we design an approach that seeks to find similarities in patterns of these logs events that leads to failures. Our experiment shows that several root causes of soft lockup failures could be traced through the logs. We capture the behavior of failure inducing patterns and realized that the logs pattern of failure and non-failure patterns are dissimilar.Keywords: Failure Sequences; Cluster; Error Logs; HPC; Similarit

    Technical Report: Anomaly Detection for a Critical Industrial System using Context, Logs and Metrics

    Get PDF
    Recent advances in contextual anomaly detection attempt to combine resource metrics and event logs to un- cover unexpected system behaviors and malfunctions at run- time. These techniques are highly relevant for critical software systems, where monitoring is often mandated by international standards and guidelines. In this technical report, we analyze the effectiveness of a metrics-logs contextual anomaly detection technique in a middleware for Air Traffic Control systems. Our study addresses the challenges of applying such techniques to a new case study with a dense volume of logs, and finer monitoring sampling rate. We propose an automated abstraction approach to infer system activities from dense logs and use regression analysis to infer the anomaly detector. We observed that the detection accuracy is impacted by abrupt changes in resource metrics or when anomalies are asymptomatic in both resource metrics and event logs. Guided by our experimental results, we propose and evaluate several actionable improvements, which include a change detection algorithm and the use of time windows on contextual anomaly detection. This technical report accompanies the paper “Contextual Anomaly Detection for a Critical Industrial System based on Logs and Metrics” [1] and provides further details on the analysis method, case study and experimental results

    CloudLens, un langage de script pour l'analyse de données semi-structurées

    Get PDF
    International audienceLors de la mise au point d'applications qui s'exécutent dans le nuage, les programmeurs ne peuvent souvent que tracer l'exécution de leur code en multipliant les impressions. Cette pratique génère des quantités astronomiques de fichiers de traces qu'il faut ensuite analyser pour trouver les causes d'un bogue. De plus, les applications combinent souvent de nombreux micro-services, et les programmeurs n'ont qu'un contrôle partiel du format des données qu'ils manipulent. Cet article présente CloudLens, un langage dédié à l'analyse de ce type de données dites semi-structurées. Il repose sur un modèle de programmation flot de données qui permet à la fois d'analyser les sources d'une erreur et de surveiller une application en cours d'exécution

    Events Classification in Log Audit

    Full text link

    Applying Machine Learning to Root Cause Analysis in Agile CI/CD Software Testing Environments

    Get PDF
    This thesis evaluates machine learning classification and clustering algorithms with the aim of automating the root cause analysis of failed tests in agile software testing environments. The inefficiency of manually categorizing the root causes in terms of time and human resources motivates this work. The development and testing environments of an agile team at Ericsson Finland are used as this work's framework. The author of the thesis extracts relevant features from the raw log data after interviewing the team's testing engineers (human experts). The author puts his initial efforts into clustering the unlabeled data, and despite obtaining qualitative correlations between several clusters and failure root causes, the vagueness in the rest of the clusters leads to the consideration of labeling. The author then carries out a new round of interviews with the testing engineers, which leads to the conceptualization of ground-truth categories for the test failures. With these, the human experts label the dataset accordingly. A collection of artificial neural networks that either classify the data or pre-process it for clustering is then optimized by the author. The best solution comes in the form of a classification multilayer perceptron that correctly assigns the failure category to new examples, on average, 88.9\% of the time. The primary outcome of this thesis comes in the form of a methodology for the extraction of expert knowledge and its adaptation to machine learning techniques for test failure root cause analysis using test log data. The proposed methodology constitutes a prototype or baseline approach towards achieving this objective in a corporate environment

    Supervised fault detection using unstructured server-log data to support root cause analysis

    Get PDF
    Fault detection is one of the most important aspects of telecommunication networks. Considering the growing scale and complexity of communication networks, maintenance and debugging have become extremely complicated and expensive. In complex systems, a higher rate of failure, due to the large number of components, has increased the importance of both fault detection and root cause analysis. Fault detection for communication networks is based on analyzing system logs from servers or different components in a network in order to determine if there is any unusual activity. However, detecting and diagnosing problems in such huge systems are challenging tasks for human, since the amount of information, which needs to be processed goes far beyond the level that can be handled manually. Therefore, there is an immense demand for automatic processing of datasets to extract the relevant data needed for detecting anomalies. In a Big Data world, using machine learning techniques to analyze log data automatically becomes more and more popular. Machine learning based fault detection does not require any prior knowledge about the types of problems and does not rely on explicit programming (such as rule-based). Machine learning has the ability to improve its performance automatically through learning from experience. In this thesis, we investigate supervised machine learning approaches to detect known faults from unstructured log data as a fast and efficient approach. As the aim is to identify abnormal cases against normal ones, anomaly detection is considered to be a binary classification. For extracting numerical features from event logs as a primary step in any classification, we used windowing along with bag-of-words approaches considering their textual characteristics (high dimension and sparseness). We focus on linear classification methods such as single layer perceptron and Support Vector Machines as promising candidate methods for supervised fault detection based on the textual characteristics of network-based server-log data. In order to generate an appropriate approach generalizing for detecting known faults, two important factors are investigated, namely the size of datasets and the time duration of faults. By investigating the experimental results concerning these two aforementioned factors, a two-layer classification is proposed to overcome the windowing and feature extraction challenges for long lasting faults. The thesis proposes a novel approach for collecting feature vectors for two layers of a two-layer classification. In the first layer we attempt to detect the starting line of each fault repetition as well as the fault duration. The obtained models from the first layer are used to create feature vectors for the second layer. In order to evaluate the learning algorithms and select the best detection model, cross validation and F-scores are used in this thesis because traditional metrics such as accuracy and error rates are not well suited for imbalanced datasets. The experimental results show that the proposed SVM classifier provides the best performance independent of fault duration, while factors such as labelling rule and reduction of the feature space have no significant effect on the performance. In addition, the results show that the two-layer classification system can improve the performance of fault detection; however, a more suited approach for collecting feature vectors with smaller time span needs to be further investigated
    corecore