1,395 research outputs found

    Review and Analysis of Failure Detection and Prevention Techniques in IT Infrastructure Monitoring

    Get PDF
    Maintaining the health of IT infrastructure components for improved reliability and availability is a research and innovation topic for many years. Identification and handling of failures are crucial and challenging due to the complexity of IT infrastructure. System logs are the primary source of information to diagnose and fix failures. In this work, we address three essential research dimensions about failures, such as the need for failure handling in IT infrastructure, understanding the contribution of system-generated log in failure detection and reactive & proactive approaches used to deal with failure situations. This study performs a comprehensive analysis of existing literature by considering three prominent aspects as log preprocessing, anomaly & failure detection, and failure prevention. With this coherent review, we (1) presume the need for IT infrastructure monitoring to avoid downtime, (2) examine the three types of approaches for anomaly and failure detection such as a rule-based, correlation method and classification, and (3) fabricate the recommendations for researchers on further research guidelines. As far as the authors\u27 knowledge, this is the first comprehensive literature review on IT infrastructure monitoring techniques. The review has been conducted with the help of meta-analysis and comparative study of machine learning and deep learning techniques. This work aims to outline significant research gaps in the area of IT infrastructure failure detection. This work will help future researchers understand the advantages and limitations of current methods and select an adequate approach to their problem

    Reliable High Performance Peta- and Exa-Scale Computing

    Get PDF
    As supercomputers become larger and more powerful, they are growing increasingly complex. This is reflected both in the exponentially increasing numbers of components in HPC systems (LLNL is currently installing the 1.6 million core Sequoia system) as well as the wide variety of software and hardware components that a typical system includes. At this scale it becomes infeasible to make each component sufficiently reliable to prevent regular faults somewhere in the system or to account for all possible cross-component interactions. The resulting faults and instability cause HPC applications to crash, perform sub-optimally or even produce erroneous results. As supercomputers continue to approach Exascale performance and full system reliability becomes prohibitively expensive, we will require novel techniques to bridge the gap between the lower reliability provided by hardware systems and users unchanging need for consistent performance and reliable results. Previous research on HPC system reliability has developed various techniques for tolerating and detecting various types of faults. However, these techniques have seen very limited real applicability because of our poor understanding of how real systems are affected by complex faults such as soft fault-induced bit flips or performance degradations. Prior work on such techniques has had very limited practical utility because it has generally focused on analyzing the behavior of entire software/hardware systems both during normal operation and in the face of faults. Because such behaviors are extremely complex, such studies have only produced coarse behavioral models of limited sets of software/hardware system stacks. Since this provides little insight into the many different system stacks and applications used in practice, this work has had little real-world impact. My project addresses this problem by developing a modular methodology to analyze the behavior of applications and systems during both normal and faulty operation. By synthesizing models of individual components into a whole-system behavior models my work is making it possible to automatically understand the behavior of arbitrary real-world systems to enable them to tolerate a wide range of system faults. My project is following a multi-pronged research strategy. Section II discusses my work on modeling the behavior of existing applications and systems. Section II.A discusses resilience in the face of soft faults and Section II.B looks at techniques to tolerate performance faults. Finally Section III presents an alternative approach that studies how a system should be designed from the ground up to make resilience natural and easy

    Real-time spatio-temporal coherence estimation for autonomous mode identification and invariance tracking

    Get PDF
    A general method of anomaly detection from time-correlated sensor data is disclosed. Multiple time-correlated signals are received. Their cross-signal behavior is compared against a fixed library of invariants. The library is constructed during a training process, which is itself data-driven using the same time-correlated signals. The method is applicable to a broad class of problems and is designed to respond to any departure from normal operation, including faults or events that lie outside the training envelope

    Experimental analysis of computer system dependability

    Get PDF
    This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance

    Orion+: Automated Problem Diagnosis in Computing Systems by Mining Metric Data

    Get PDF
    Nowadays, distributed systems are a necessity of almost all big enterprises. It is a programmers nightmare to encounter a bug which causes failures in the system and leads to a crash on such a large infrastructure. With the ever increasing code sizes and processing needs, a tool is required that is able to assist a programmer in figuring out potential causes of a bug and minimizing time taken for debugging, hence rectifying it quickly. We present our solution Orion+, which compares the system metrics at various levels, namely, hardware, OS, middleware and application layer. It then makes use of the association information provided by the stack traces of the normal and abnormal runs to narrow down the specified buggy code region to a particular sequence of function calls that contain the bug or are most affected by the bug. We benchmarked our work against already established bugs in open source software which have been fixed and find that Orion+ is able to provide root cause analysis for all the benchmark bugs

    Advanced Fault Diagnosis and Health Monitoring Techniques for Complex Engineering Systems

    Get PDF
    Over the last few decades, the field of fault diagnostics and structural health management has been experiencing rapid developments. The reliability, availability, and safety of engineering systems can be significantly improved by implementing multifaceted strategies of in situ diagnostics and prognostics. With the development of intelligence algorithms, smart sensors, and advanced data collection and modeling techniques, this challenging research area has been receiving ever-increasing attention in both fundamental research and engineering applications. This has been strongly supported by the extensive applications ranging from aerospace, automotive, transport, manufacturing, and processing industries to defense and infrastructure industries
    • …
    corecore