3,748 research outputs found

    Understanding error log event sequence for failure analysis

    Get PDF
    Due to the evolvement of large-scale parallel systems, they are mostly employed for mission critical applications. The anticipation and accommodation of failure occurrences is crucial to the design. A commonplace feature of these large-scale systems is failure, and they cannot be treated as exception. The system state is mostly captured through the logs. The need for proper understanding of these error logs for failure analysis is extremely important. This is because the logs contain the “health” information of the system. In this paper we design an approach that seeks to find similarities in patterns of these logs events that leads to failures. Our experiment shows that several root causes of soft lockup failures could be traced through the logs. We capture the behavior of failure inducing patterns and realized that the logs pattern of failure and non-failure patterns are dissimilar.Keywords: Failure Sequences; Cluster; Error Logs; HPC; Similarit

    Next stop 'NoOps': enabling cross-system diagnostics through graph-based composition of logs and metrics

    Get PDF
    Performing diagnostics in IT systems is an increasingly complicated task, and it is not doable in satisfactory time by even the most skillful operators. Systems and their architecture change very rapidly in response to business and user demand. Many organizations see value in the maintenance and management model of NoOps that stands for No Operations. One of the implementations of this model is a system that is maintained automatically without any human intervention. The path to NoOps involves not only precise and fast diagnostics but also reusing as much knowledge as possible after the system is reconfigured or changed. The biggest challenge is to leverage knowledge on one IT system and reuse this knowledge for diagnostics of another, different system. We propose a framework of weighted graphs which can transfer knowledge, and perform high-quality diagnostics of IT systems. We encode all possible data in a graph representation of a system state and automatically calculate weights of these graphs. Then, thanks to the evaluation of similarity between graphs, we transfer knowledge about failures from one system to another and use it for diagnostics. We successfully evaluate the proposed approach on Spark, Hadoop, Kafka and Cassandra systems.Peer ReviewedPostprint (author's final draft

    Identifying common problems in the acquisition and deployment of large-scale software projects in the US and UK healthcare systems

    Get PDF
    Public and private organizations are investing increasing amounts into the development of healthcare information technology. These applications are perceived to offer numerous benefits. Software systems can improve the exchange of information between healthcare facilities. They support standardised procedures that can help to increase consistency between different service providers. Electronic patient records ensure minimum standards across the trajectory of care when patients move between different specializations. Healthcare information systems also offer economic benefits through efficiency savings; for example by providing the data that helps to identify potential bottlenecks in the provision and administration of care. However, a number of high-profile failures reveal the problems that arise when staff must cope with the loss of these applications. In particular, teams have to retrieve paper based records that often lack the detail on electronic systems. Individuals who have only used electronic information systems face particular problems in learning how to apply paper-based fallbacks. The following pages compare two different failures of Healthcare Information Systems in the UK and North America. The intention is to ensure that future initiatives to extend the integration of electronic patient records will build on the ‘lessons learned’ from previous systems

    Identifying recovery patterns from resource usage data of cluster systems

    Get PDF
    Failure of Cluster Systems has proven to be of adverse effect and it can be costly. System administrators have employed divide and conquer approach to diagnosing the root-cause of such failure in order to take corrective or preventive measures. Most times, event logs are the source of the information about the failures. Events that characterized failures are then noted and categorized as causes of failure. However, not all the ’causative’ events lead to eventual failure, as some faults sequence experience recovery. Such sequences or patterns constitute challenge to system administrators and failure prediction tools as they add to false positives. Their presence are always predicted as “failure causing“, while in reality, they will not. In order to detect such recovery patterns of events from failure patterns, we proposed a novel approach that utilizes resource usage data of cluster systems to identify recovery and failure sequences. We further propose an online detection approach to the same problem. We experiment our approach on data from Ranger Supercomputer System and the results are positive.Keywords: Change point detection; resource usage data; recovery sequence; detection; large-scale HPC system

    DeCaf: Diagnosing and Triaging Performance Issues in Large-Scale Cloud Services

    Full text link
    Large scale cloud services use Key Performance Indicators (KPIs) for tracking and monitoring performance. They usually have Service Level Objectives (SLOs) baked into the customer agreements which are tied to these KPIs. Dependency failures, code bugs, infrastructure failures, and other problems can cause performance regressions. It is critical to minimize the time and manual effort in diagnosing and triaging such issues to reduce customer impact. Large volume of logs and mixed type of attributes (categorical, continuous) in the logs makes diagnosis of regressions non-trivial. In this paper, we present the design, implementation and experience from building and deploying DeCaf, a system for automated diagnosis and triaging of KPI issues using service logs. It uses machine learning along with pattern mining to help service owners automatically root cause and triage performance issues. We present the learnings and results from case studies on two large scale cloud services in Microsoft where DeCaf successfully diagnosed 10 known and 31 unknown issues. DeCaf also automatically triages the identified issues by leveraging historical data. Our key insights are that for any such diagnosis tool to be effective in practice, it should a) scale to large volumes of service logs and attributes, b) support different types of KPIs and ranking functions, c) be integrated into the DevOps processes.Comment: To be published in the proceedings of ICSE-SEIP '20, Seoul, Republic of Kore

    Fault diagnosis for IP-based network with real-time conditions

    Get PDF
    BACKGROUND: Fault diagnosis techniques have been based on many paradigms, which derive from diverse areas and have different purposes: obtaining a representation model of the network for fault localization, selecting optimal probe sets for monitoring network devices, reducing fault detection time, and detecting faulty components in the network. Although there are several solutions for diagnosing network faults, there are still challenges to be faced: a fault diagnosis solution needs to always be available and able enough to process data timely, because stale results inhibit the quality and speed of informed decision-making. Also, there is no non-invasive technique to continuously diagnose the network symptoms without leaving the system vulnerable to any failures, nor a resilient technique to the network's dynamic changes, which can cause new failures with different symptoms. AIMS: This thesis aims to propose a model for the continuous and timely diagnosis of IP-based networks faults, independent of the network structure, and based on data analytics techniques. METHOD(S): This research's point of departure was the hypothesis of a fault propagation phenomenon that allows the observation of failure symptoms at a higher network level than the fault origin. Thus, for the model's construction, monitoring data was collected from an extensive campus network in which impact link failures were induced at different instants of time and with different duration. These data correspond to widely used parameters in the actual management of a network. The collected data allowed us to understand the faults' behavior and how they are manifested at a peripheral level. Based on this understanding and a data analytics process, the first three modules of our model, named PALADIN, were proposed (Identify, Collection and Structuring), which define the data collection peripherally and the necessary data pre-processing to obtain the description of the network's state at a given moment. These modules give the model the ability to structure the data considering the delays of the multiple responses that the network delivers to a single monitoring probe and the multiple network interfaces that a peripheral device may have. Thus, a structured data stream is obtained, and it is ready to be analyzed. For this analysis, it was necessary to implement an incremental learning framework that respects networks' dynamic nature. It comprises three elements, an incremental learning algorithm, a data rebalancing strategy, and a concept drift detector. This framework is the fourth module of the PALADIN model named Diagnosis. In order to evaluate the PALADIN model, the Diagnosis module was implemented with 25 different incremental algorithms, ADWIN as concept-drift detector and SMOTE (adapted to streaming scenario) as the rebalancing strategy. On the other hand, a dataset was built through the first modules of the PALADIN model (SOFI dataset), which means that these data are the incoming data stream of the Diagnosis module used to evaluate its performance. The PALADIN Diagnosis module performs an online classification of network failures, so it is a learning model that must be evaluated in a stream context. Prequential evaluation is the most used method to perform this task, so we adopt this process to evaluate the model's performance over time through several stream evaluation metrics. RESULTS: This research first evidences the phenomenon of impact fault propagation, making it possible to detect fault symptoms at a monitored network's peripheral level. It translates into non-invasive monitoring of the network. Second, the PALADIN model is the major contribution in the fault detection context because it covers two aspects. An online learning model to continuously process the network symptoms and detect internal failures. Moreover, the concept-drift detection and rebalance data stream components which make resilience to dynamic network changes possible. Third, it is well known that the amount of available real-world datasets for imbalanced stream classification context is still too small. That number is further reduced for the networking context. The SOFI dataset obtained with the first modules of the PALADIN model contributes to that number and encourages works related to unbalanced data streams and those related to network fault diagnosis. CONCLUSIONS: The proposed model contains the necessary elements for the continuous and timely diagnosis of IPbased network faults; it introduces the idea of periodical monitorization of peripheral network elements and uses data analytics techniques to process it. Based on the analysis, processing, and classification of peripherally collected data, it can be concluded that PALADIN achieves the objective. The results indicate that the peripheral monitorization allows diagnosing faults in the internal network; besides, the diagnosis process needs an incremental learning process, conceptdrift detection elements, and rebalancing strategy. The results of the experiments showed that PALADIN makes it possible to learn from the network manifestations and diagnose internal network failures. The latter was verified with 25 different incremental algorithms, ADWIN as concept-drift detector and SMOTE (adapted to streaming scenario) as the rebalancing strategy. This research clearly illustrates that it is unnecessary to monitor all the internal network elements to detect a network's failures; instead, it is enough to choose the peripheral elements to be monitored. Furthermore, with proper processing of the collected status and traffic descriptors, it is possible to learn from the arriving data using incremental learning in cooperation with data rebalancing and concept drift approaches. This proposal continuously diagnoses the network symptoms without leaving the system vulnerable to failures while being resilient to the network's dynamic changes.Programa de Doctorado en Ciencia y Tecnología Informåtica por la Universidad Carlos III de MadridPresidente: José Manuel Molina López.- Secretario: Juan Carlos Dueñas López.- Vocal: Juan Manuel Corchado Rodrígue

    Fault Diagnosis in Enterprise Software Systems Using Discrete Monitoring Data

    Get PDF
    Success for many businesses depends on their information software systems. Keeping these systems operational is critical, as failure in these systems is costly. Such systems are in many cases sophisticated, distributed and dynamically composed. To ensure high availability and correct operation, it is essential that failures be detected promptly, their causes diagnosed and remedial actions taken. Although automated recovery approaches exists for specific problem domains, the problem-resolution process is in many cases manual and painstaking. Computer support personnel put a great deal of effort into resolving the reported failures. The growing size and complexity of these systems creates the need to automate this process. The primary focus of our research is on automated fault diagnosis and recovery using discrete monitoring data such as log files and notifications. Our goal is to quickly pinpoint the root-cause of a failure. Our contributions are: Modelling discrete monitoring data for automated analysis, automatically leveraging common symptoms of failures from historic monitoring data using such models to pinpoint faults, and providing a model for decision-making under uncertainty such that appropriate recovery actions are chosen. Failures in such systems are caused by software defects, human error, hardware failures, environmental conditions and malicious behaviour. Our primary focus in this thesis is on software defects and misconfiguration
    • 

    corecore