7 research outputs found

    System log pre-processing to improve failure prediction

    Full text link
    Log preprocessing, a process applied on the raw log be-fore applying a predictive method, is of paramount impor-tance to failure prediction and diagnosis. While existing fil-tering methods have demonstrated good compression rate, they fail to preserve important failure patterns that are cru-cial for failure analysis. To address the problem, in this paper we present a log preprocessing method. It consists of three integrated steps: (1) event categorization to uni-formly classify system events and identify fatal events; (2) event filtering to remove temporal and spatial redundant records, while also preserving necessary failure patterns for failure analysis; (3) causality-related filtering to com-bine correlated events for filtering through apriori associ-ation rule mining. We demonstrate the effectiveness of our preprocessing method by using real failure logs collected from the Cray XT4 at ORNL and the Blue Gene/L system at SDSC. Experiments show that our method can preserve more failure patterns for failure analysis, thereby improv-ing failure prediction by up to 174%

    Analysis of Gemini interconnect recovery mechanisms: methods and observations

    Get PDF
    This thesis focuses on the resilience of network components, and recovery capabilities of extreme-scale high-performance computing (HPC) systems, specifically petaflop-level supercomputers, aimed at solving complex science, engineering, and business problems that require high bandwidth, enhanced networking, and high compute capabilities. The resilience of the network is critical for ensuring successful execution of the applications and overall system availability. Failure of interconnect components such as links, routers, power supply, etc. pose a threat to the resilience of the interconnect network, causing application failures and, in the worst case, system-wide failure. An extreme-scale system is designed to manage these failures and automatically recover from such failures to ensure successful application execution and avoid system-wide failure. Thus, in this thesis, we characterize the success probability of the recovery procedures as well as the impact of the recovery procedures on the applications. We developed an interconnect recovery mechanisms analysis tool (I-RAT), a plugin built on top of LogDiver to characterize and assess the impact of recovery mechanisms. The tool was used to analyze more than two years of network/system logs from Blue Waters, a supercomputer operated by the NCSA at the University of Illinois. Our analyses show that recovery mechanisms are frequently triggered (in as little as 36 hours for link failovers) that can fail with relatively high probability (as much as 0.25 for link failover). Furthermore, the analyses show that system resilience does not equate to application resilience since executing applications can fail with non-negligible probability during (or just after) a successful recovery. Our analyses show that interconnect recovery mechanisms are frequently triggered (the mean time between triggers is as short as 36 hours for link failovers), and the initiated recovery fails with relatively high probability (as much as 0.25 for link failover). We also show that as many as 20\% of the executing applications fail during the recovery phase

    Log-based software monitoring: a systematic mapping study

    Full text link
    Modern software development and operations rely on monitoring to understand how systems behave in production. The data provided by application logs and runtime environment are essential to detect and diagnose undesired behavior and improve system reliability. However, despite the rich ecosystem around industry-ready log solutions, monitoring complex systems and getting insights from log data remains a challenge. Researchers and practitioners have been actively working to address several challenges related to logs, e.g., how to effectively provide better tooling support for logging decisions to developers, how to effectively process and store log data, and how to extract insights from log data. A holistic view of the research effort on logging practices and automated log analysis is key to provide directions and disseminate the state-of-the-art for technology transfer. In this paper, we study 108 papers (72 research track papers, 24 journals, and 12 industry track papers) from different communities (e.g., machine learning, software engineering, and systems) and structure the research field in light of the life-cycle of log data. Our analysis shows that (1) logging is challenging not only in open-source projects but also in industry, (2) machine learning is a promising approach to enable a contextual analysis of source code for log recommendation but further investigation is required to assess the usability of those tools in practice, (3) few studies approached efficient persistence of log data, and (4) there are open opportunities to analyze application logs and to evaluate state-of-the-art log analysis techniques in a DevOps context

    Prediction-based failure management for supercomputers

    Get PDF
    The growing requirements of a diversity of applications necessitate the deployment of large and powerful computing systems and failures in these systems may cause severe damage in every aspect from loss of human lives to world economy. However, current fault tolerance techniques cannot meet the increasing requirements for reliability. Thus new solutions are urgently needed and research on proactive schemes is one of the directions that may offer better efficiency. This thesis proposes a novel proactive failure management framework. Its goal is to reduce the failure penalties and improve fault tolerance efficiency in supercomputers when running complex applications. The proposed proactive scheme builds on two core components: failure prediction and proactive failure recovery. More specifically, the failure prediction component is based on the assessment of system events and employs semi-Markov models to capture the dependencies between failures and other events for the forecasting of forthcoming failures. Furthermore, a two-level failure prediction strategy is described that not only estimates the future failure occurrence but also identifies the specific failure categories. Based on the accurate failure forecasting, a prediction-based coordinated checkpoint mechanism is designed to construct extra checkpoints just before each predicted failure occurrence so that the wasted computational time can be significantly reduced. Moreover, a theoretical model has been developed to assess the proactive scheme that enables calculation of the overall wasted computational time.The prediction component has been applied to industrial data from the IBM BlueGene/L system. Results of the failure prediction component show a great improvement of the prediction accuracy in comparison with three other well-known prediction approaches, and also demonstrate that the semi-Markov based predictor, which has achieved the precision of 87.41% and the recall of 77.95%, performs better than other predictors.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Understanding the Error Behavior of Complex Critical Software Systems through Field Data

    Get PDF
    Software systems are the basis for human everyday activities, which are increasingly dependent on software. Software is an integral part of systems we interact with in our daily life raging form small systems for entertainment and domotics, to large systems and infrastructures that provide fundamental services such as telecommunication, transportation, and financial. In particular, software systems play a key role in the context of critical domains, supporting crucial activities. For example, ground and air transportation, power supply, nuclear plants, and medical applications strongly rely on software systems: failures affecting these systems can lead to severe consequences, which can be catastrophic in terms of business or, even worse, human losses. Therefore, given the growing dependence on software systems in life- and critical-applications, dependability, has become among one of the most relevant industry and research concerns in the last decades. Software faults have been recognized as one of the major cause for system failures since the hardware failure rate has been decreasing over the years. Time and cost constraints, along with technical limitations, often do not allow to fully validate the correctness of the software solely by means of testing; therefore, software might be released with residual faults that activate during operations. The activation of a fault generates errors which propagate through the components of the system, possibly leading to a failure. Therefore, in order to produce reliable software, it is important to understand how errors affect a software system. This is of paramount importance especially in the context of complex critical software systems, where the occurrence of a failure can lead to severe consequences. However, the analysis of the error behavior of this kind of system is not trivial. They are often distributed systems based on many interacting heterogeneous components and layers, including Off-The-Shelf (OTS), third party components and legacy systems. All these aspects, undermine the understanding of the error behavior of complex critical software system. A well established methodology to evaluate the dependability of operational systems and to identify their dependability bottlenecks is represented by field failure data analysis (FFDA), which is based on the monitoring and recording of errors and failures occurred during the operational phase of the system under real workload conditions, i.e., field data. Indeed, direct measurement and analysis of natural failures occurring under real workload conditions is among the most accurate ways to assess dependability characteristics. One of the main sources of field data, are monitoring techniques. The contribution of the thesis is to provide a methodology that allows understanding the error behavior of complex critical software systems by means of field data generated by the monitoring techniques already implemented in the target system. The use of available monitoring techniques allows to overcome the limitations imposed in the context of critical systems, avoiding severe changes in the system, and preserving its functionality and performance. The methodology is based on fault injection experiments that stimulate the target system with different error conditions. Injection experiments allow to accelerate the collection of error data naturally generated by the monitoring techniques already implemented in the system. The collected data are analyzed in order to characterize the behavior of the system under the occurred software errors. To this aim, the proposed methodology leverages a set of innovative means defined in this dissertation, i.e., (i) Error Propagation graphs, which allow to analyze the error propagation phenomena occurred in the target system and that can be inferred by the collected field data, and a set of metrics composed by (ii) Error Determination Degree, which allows gaining insights into the ability of error notifications of a monitoring technique to suggest either the fault that led to the error, or the failure the error led to in the system, (iii) Error Propagation Reportability, which allow understanding the ability of a monitoring technique at reporting the propagation of errors, and (iv) Data Dissimilarity, which allows gaining insights into the suitability of the data generated by the monitoring techniques for failure analysis. The methodology has been experimented on two instances of complex critical software systems in the field of Air Traffic Control (ATC), i.e., a communication middleware supporting data exchanging among ATC applications, and an arrival manager that is responsible for managing flight arrivals to a given airspace, within an industry-academia collaboration in the context of a national research project. Results show that field data generated by means of monitoring techniques already implemented in a complex critical software system can be leveraged to obtain insights about the error behavior exhibited by the target system, as well as about the potential beneficial locations for EDMs and ERMs. In addition, the proposed methodology also allowed to characterize the effectiveness of the monitoring techniques in terms of failure reporting, error propagation reportability, and data dissimilarity

    Towards efficient error detection in large-scale HPC systems

    Get PDF
    The need for computer systems to be reliable has increasingly become important as the dependence on their accurate functioning by users increases. The failure of these systems could very costly in terms of time and money. In as much as system's designers try to design fault-free systems, it is practically impossible to have such systems as different factors could affect them. In order to achieve system's reliability, fault tolerance methods are usually deployed; these methods help the system to produce acceptable results even in the presence of faults. Root cause analysis, a dependability method for which the causes of failures are diagnosed for the purpose of correction or prevention of future occurrence is less efficient. It is reactive and would not prevent the first failure from occurring. For this reason, methods with predictive capabilities are preferred; failure prediction methods are employed to predict the potential failures to enable preventive measures to be applied. Most of the predictive methods have been supervised, requiring accurate knowledge of the system's failures, errors and faults. However, with changing system components and system updates, supervised methods are ineffective. Error detection methods allows error patterns to be detected early to enable preventive methods to be applied. Performing this detection in an unsupervised way could be more effective as changes to systems or updates would less affect such a solution. In this thesis, we introduced an unsupervised approach to detecting error patterns in a system using its data. More specifically, the thesis investigates the use of both event logs and resource utilization data to detect error patterns. It addresses both the spatial and temporal aspects of achieving system dependability. The proposed unsupervised error detection method has been applied on real data from two different production systems. The results are positive; showing average detection F-measure of about 75%
    corecore