135,159 research outputs found

    Seer: a lightweight online failure prediction approach

    Get PDF
    Online failure prediction aims to predict the manifestation of failures at runtime before the failures actually occur. Existing online failure prediction approaches typically operate on data which is either directly reported by the system under test or directly observable from outside system executions. These approaches generally refrain themselves from collecting internal execution data that can further improve the prediction quality. One reason behind this general trend is due to the runtime overhead cost incurred by the measurement instruments that are required to collect the data. In this work we conjecture that large cost reductions in collecting internal execution data for online failure prediction can derive from reducing the cost of the measurement instruments, while still supporting acceptable levels of prediction quality. To evaluate this conjecture, we present a lightweight online failure prediction approach, called Seer. Seer uses fast hardware performance counters to perform most of the data collection work. The data is augmented with further data collected by a minimal amount of software instrumentation that is added to the systems software. We refer to the data collected in this manner as hybrid spectra. We applied the proposed approach to three widely used open source subject applications and evaluated it by comparing and contrasting three types of hybrid spectra and two types of traditional software spectra. At the lowest level of runtime overheads attained in the experiments, the hybrid spectra predicted the failures about half way through the executions with an F-measure of 0.77 and a runtime overhead of 1.98%, on average. Comparing hybrid spectra to software spectra, we observed that, for comparable runtime overhead levels, the hybrid spectra provided significantly better prediction accuracies and earlier warnings for failures than the software spectra. Alternatively, for comparable accuracy levels, the hybrid spectra incurred significantly less runtime overheads and provided earlier warnings

    Software failure prediction based on patterns of multiple-event failures

    Get PDF
    A fundamental need for software reliability engineering is to comprehend how software systems fail, which means understanding the dynamics that govern different types of failure manifestation. In this research, I present an exploratory study on multiple-event failures, which is a failure manifestation characterized by sequences of failure events, varying in terms of length, duration, and combination of failure types. This study aims to (i) improve the understanding of multiple-event failures in real software systems, investigating their occurrences, associations, and causes; (ii) propose analysis protocols that take into account multiple-event failure manifestations; (iii) take advantage of the sequential nature of this type of software failure to perform predictions. The failures analyzed in this research were observed empirically. In total, I analyzed 42,209 real software failures from 644 computers used in different workplaces. The major contributions of this study are a protocol developed to investigate the existence of patterns of failure associations; a protocol to discover patterns of failure sequences; and a prediction approach whose main concept is to calculate the probability of a certain failure event to occur within a time interval upon the occurrence of a particular pattern of preceding failures. I used three methods to tackle the prediction problem; Multinomial Logistic Regression (w/ and w/o Ridge regularization), Decision Tree, and Random Forest. These methods were chosen due to the nature of the failure data, in which the failure types must be handled as categorical variables. Initially, I performed a failure association discovery analysis which only included failures from a widely used commercial off-the-shelf Operating System (OS). As a result, I discovered 45 OS failure association patterns with 153,511 occurrences, which were composed of the same or different failure types and occurring within well-established time intervals, systematically. The observed associations suggest the existence of underlying mechanisms governing these failure occurrences, which motivated the improvement of the previous method by creating a protocol to discover patterns of failure sequences using flexible time thresholds and a failure prediction approach. To have a comprehensive view of how different software failures may affect each other, both methods were applied to three different samples — the first sample contained only OS failures, the second contained only User Application failures, and the third encompassed both OS and User Application failures altogether. As a result, I found 165, 480, and 640 different failure sequences with thousands of occurrences, respectively. Finally, the proposed approach was able to predict failures with good to high accuracy (86% to 93%).CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível SuperiorTese (Doutorado)Uma necessidade fundamental para a engenharia de confiabilidade de software é compreender como os sistemas de software falham, que significa entender a dinâmica que governa os diferentes tipos de manifestação de falha. Esta pesquisa apresenta um estudo exploratório sobre falhas de múltiplos eventos, que é uma manifestação de falha caracterizada por sequências de eventos de falha que variam em comprimento, duração e combinação de tipos de falha. Este estudo visa (i) melhorar a compreensão das falhas de múltiplos eventos em sistemas de software reais, investigando suas ocorrências, associações e causas; (ii) propor protocolos de análise que levem em consideração as manifestações de falha de múltiplos eventos; (iii) aproveitar a natureza sequencial desse tipo de falha de software para realizar previsões. As falhas analisadas nesta pesquisa foram observadas empiricamente. No total, foram analisadas 42.209 falhas reais de software de 644 computadores de diferentes locais de trabalho. As principais contribuições deste estudo são um protocolo desenvolvido para investigar a existência de padrões de associações de falha; um protocolo para descobrir padrões de sequências de falha; e uma abordagem de previsão cuja principal ideia é calcular a probabilidade de um determinado evento de falha ocorrer dentro de um intervalo de tempo após a ocorrência de um padrão particular de falhas anteriores. Três métodos foram utilizados para resolver o problema de previsão; Regressão Logística Multinomial (com ou sem regularização Ridge), Decision Tree e Random Forest. Tais métodos foram escolhidos devido à natureza dos dados de falha, nos quais os tipos de falha devem ser tratados como variáveis categóricas. Inicialmente, foi realizada uma análise de descoberta de associação de falhas que considerou apenas falhas de um sistema operacional (SO) comercial amplamente utilizado. Como resultado, foram descobertos 45 padrões de associação de falhas de sistema operacional com 153.511 ocorrências, compostos dos mesmos ou diferentes tipos de falha e ocorrendo, sistematicamente, em intervalos de tempo bem estabelecidos. As associações observadas sugerem a existência de mecanismos subjacentes que regem essas ocorrências de falha, o que motivou o aprimoramento do método anterior, com a criação de um protocolo para descobrir padrões de sequências de falhas usando limites de tempo flexíveis e uma abordagem de previsão de falha. Para ter uma visão abrangente de como as diferentes falhas de software podem afetar umas às outras, os dois métodos foram aplicados a três amostras diferentes — a primeira amostra contém apenas falhas do Sistema Operacional, a segunda contém apenas falhas de Aplicativos do Usuário e a terceira engloba falhas do Sistema Operacional e de Aplicativos de Usuário. Como resultado, foram encontradas 165, 480 e 640 sequências de falha diferentes com milhares de ocorrências, respectivamente. Por fim, a abordagem proposta foi capaz de prever falhas com boa até alta precisão (86% a 93%)

    Analyzing and Predicting Effort Associated with Finding and Fixing Software Faults

    Get PDF
    Context: Software developers spend a significant amount of time fixing faults. However, not many papers have addressed the actual effort needed to fix software faults. Objective: The objective of this paper is twofold: (1) analysis of the effort needed to fix software faults and how it was affected by several factors and (2) prediction of the level of fix implementation effort based on the information provided in software change requests. Method: The work is based on data related to 1200 failures, extracted from the change tracking system of a large NASA mission. The analysis includes descriptive and inferential statistics. Predictions are made using three supervised machine learning algorithms and three sampling techniques aimed at addressing the imbalanced data problem. Results: Our results show that (1) 83% of the total fix implementation effort was associated with only 20% of failures. (2) Both safety critical failures and post-release failures required three times more effort to fix compared to non-critical and pre-release counterparts, respectively. (3) Failures with fixes spread across multiple components or across multiple types of software artifacts required more effort. The spread across artifacts was more costly than spread across components. (4) Surprisingly, some types of faults associated with later life-cycle activities did not require significant effort. (5) The level of fix implementation effort was predicted with 73% overall accuracy using the original, imbalanced data. Using oversampling techniques improved the overall accuracy up to 77%. More importantly, oversampling significantly improved the prediction of the high level effort, from 31% to around 85%. Conclusions: This paper shows the importance of tying software failures to changes made to fix all associated faults, in one or more software components and/or in one or more software artifacts, and the benefit of studying how the spread of faults and other factors affect the fix implementation effort

    Analysis of Software Aging in a Web Server

    Get PDF
    A number of recent studies have reported the phenomenon of “software aging”, characterized by progressive performance degradation and/or an increased occurrence rate of hang/crash failures of a software system due to the exhaustion of operating system resources or the accumulation of errors. To counteract this phenomenon, a proactive technique called 'software rejuvenation' has been proposed. It essentially involves stopping the running software, cleaning its internal state and/or its environment and then restarting it. Software rejuvenation, being preventive in nature, begs the question as to when to schedule it. Periodic rejuvenation, while straightforward to implement, may not yield the best results, because the rate at which software ages is not constant, but it depends on the time-varying system workload. Software rejuvenation should therefore be planned and initiated in the face of the actual system behavior. This requires the measurement, analysis and prediction of system resource usage. In this paper, we study the development of resource usage in a web server while subjecting it to an artificial workload. We first collect data on several system resource usage and activity parameters. Non-parametric statistical methods are then applied for detecting and estimating trends in the data sets. Finally, we fit time series models to the data collected. Unlike the models used previously in the research on software aging, these time series models allow for seasonal patterns, and we show how the exploitation of the seasonal variation can help in adequately predicting the future resource usage. Based on the models employed here, proactive management techniques like software rejuvenation triggered by actual measurements can be built. --Software aging,software rejuvenation,Linux,Apache,web server,performance monitoring,prediction of resource utilization,non-parametric trend analysis,time series analysis

    Identifying Common Patterns and Unusual Dependencies in Faults, Failures and Fixes for Large-scale Safety-critical Software

    Get PDF
    As software evolves, becoming a more integral part of complex systems, modern society becomes more reliant on the proper functioning of such systems. However, the field of software quality assurance lacks detailed empirical studies from which best practices can be determined. The fundamental factors that contribute to software quality are faults, failures and fixes, and although some studies have considered specific aspects of each, comprehensive studies have been quite rare. Thus, the fact that we establish the cause-effect relationship between the fault(s) that caused individual failures, as well as the link to the fixes made to prevent the failures from (re)occurring appears to be a unique characteristic of our work. In particular, we analyze fault types, verification activities, severity levels, investigation effort, artifacts fixed, components fixed, and the effort required to implement fixes for a large industrial case study. The analysis includes descriptive statistics, statistical inference through formal hypothesis testing, and data mining. Some of the most interesting empirical results include (1) Contrary to popular belief, later life-cycle faults dominate as causes of failures. Furthermore, over 50% of high priority failures (e.g., post-release failures and safety-critical failures) were caused by coding faults. (2) 15% of failures led to fixes spread across multiple components and the spread was largely affected by the software architecture. (3) The amount of effort spent fixing faults associated with each failure was not uniformly distributed across failures; fixes with a greater spread across components and artifacts, required more effort. Overall, the work indicates that fault prevention and elimination efforts focused on later life cycle faults is essential as coding faults were the dominating cause of safety-critical failures and post-release failures. Further, statistical correlation and/or traditional data mining techniques show potential for assessment and prediction of the locations of fixes and the associated effort. By providing quantitative results and including statistical hypothesis testing, which is not yet a standard practice in software engineering, our work enriches the empirical knowledge needed to improve the state-of-the-art and practice in software quality assurance

    Lightweight runtime failure prediction

    Get PDF
    Software systems are getting increasingly complex and bigger in size. When these general trends are coupled with the shortcomings of software quality assurance techniques and time-to-market pressures, development houses are forced to release their software with many known and unknown defects, which inevitably cause failures in the field. Many approaches have been proposed in the literature to predict the manifestation of software failures at runtime and proactively take preventive measures, such as preventing the failures or decreasing their harmful consequences. Runtime prediction of failures is an integral part of such proactivepreventive frameworks. One downside of the existing approaches is that they treat software systems as a black-box and leverage only the profiling data which are directly observable from outside the programs, such as, CPU, memory, and network utilizations. Internal execution data is typically not leveraged. This is solely due to the potential runtime overhead cost that can be imposed by collecting internal execution data while the programs are running. As the failure prediction approaches target software systems operating in the field, high overhead costs are generally not acceptable. Consequently, the existing approaches mainly target at predicting failures caused by software aging. In this thesis, we present a lightweight runtime failure prediction approach that leverages internal execution data. We, furthermore, evaluate the approach by conducting a series of large-scale experiments, in which three widely-used software applications were used as subject applications. The results of our experiments strongly suggest that the proposed approach can reliably predict software failures at an affordable cost

    Software Defect Prediction Using AWEIG+ADACOST Bayesian Algorithm for Handling High Dimensional Data and Class Imbalance Problem

    Get PDF
    The most important part in software engineering is a software defect prediction. Software defect prediction is defined as a software prediction process from errors, failures, and system errors. Machine learning methods are used by researchers to predict software defects including estimation, association, classification, clustering, and datasets analysis. Datasets of NASA Metrics Data Program (NASA MDP) is one of the metric software that researchers use to predict software defects. NASA MDP datasets contain unbalanced classes and high dimensional data, so they will affect the classification evaluation results to be low. In this research, data with unbalanced classes will be solved by the AdaCost method and high dimensional data will be handled with the Average Weight Information Gain (AWEIG) method, while the classification method that will be used is the Naïve Bayes algorithm. The proposed method is named AWEIG + AdaCost Bayesian. In this experiment, the AWEIG + AdaCost Bayesian algorithm is compared to the Naïve Bayesian algorithm. The results showed the mean of Area Under the Curve (AUC) algorithm AWEIG + AdaCost Bayesian yields better than just a Naïve Bayes algorithm with respectively mean of AUC values are 0.752 and 0.696

    Towards A Software Failure Cost Impact Model for the Customer An Analysis of an Open Source Product

    Get PDF
    ABSTRACT While the financial consequences of software errors on the developer's side have been explored extensively, the costs arising for the end user have been largely neglected. One reason is the difficulty of linking errors in the code with emerging failure behavior of the software. The problem becomes even more difficult when trying to predict failure probabilities based on models or code metrics. In this paper we take a first step towards a cost prediction model by exploring the possibilities of modeling the financial consequences of already identified software failures. Firefox, a well-known open source software, is used as a test subject. Historically identified failures are modeled using fault trees. To identify costs, usage profiles are employed to depict the interaction with the system. The presented approach demonstrates the possibility to model failure cost for an organization using a specific software by establishing a relationship between user behavior, software failures, and costs. As future work, an extension with software error prediction techniques as well as an empirical validation of the model is aspired

    Modeling and analysis of high availability techniques in a virtualized system

    Get PDF
    Availability evaluation of a virtualized system is critical to the wide deployment of cloud computing services. Time-based, prediction-based rejuvenation of virtual machines (VM) and virtual machine monitors, VM failover and live VM migration are common high-availability (HA) techniques in a virtualized system. This paper investigates the effect of combination of these availability techniques on VM availability in a virtualized system where various software and hardware failures may occur. For each combination, we construct analytic models rejuvenation mechanisms to improve VM availability; (2) prediction-based rejuvenation enhances VM availability much more than time-based VM rejuvenation when prediction successful probability is above 70%, regardless failover and/or live VM migration is also deployed; (3) failover mechanism outperforms live VM migration, although they can work together for higher availability of VM. In addition, they can combine with software rejuvenation mechanisms for even higher availability; (4) and time interval setting is critical to a time-based rejuvenation mechanism. These analytic results provide guidelines for deploying and parameter setting of HA techniques in a virtualized system
    corecore