2,971 research outputs found

    Classification in sparse, high dimensional environments applied to distributed systems failure prediction

    Get PDF
    Network failures are still one of the main causes of distributed systems’ lack of reliability. To overcome this problem we present an improvement over a failure prediction system, based on Elastic Net Logistic Regression and the application of rare events prediction techniques, able to work with sparse, high dimensional datasets. Specifically, we prove its stability, fine tune its hyperparameter and improve its industrial utility by showing that, with a slight change in dataset creation, it can also predict the location of a failure, a key asset when trying to take a proactive approach to failure management

    Understanding error log event sequence for failure analysis

    Get PDF
    Due to the evolvement of large-scale parallel systems, they are mostly employed for mission critical applications. The anticipation and accommodation of failure occurrences is crucial to the design. A commonplace feature of these large-scale systems is failure, and they cannot be treated as exception. The system state is mostly captured through the logs. The need for proper understanding of these error logs for failure analysis is extremely important. This is because the logs contain the “health” information of the system. In this paper we design an approach that seeks to find similarities in patterns of these logs events that leads to failures. Our experiment shows that several root causes of soft lockup failures could be traced through the logs. We capture the behavior of failure inducing patterns and realized that the logs pattern of failure and non-failure patterns are dissimilar.Keywords: Failure Sequences; Cluster; Error Logs; HPC; Similarit

    On-line failure prediction in safety-critical systems

    Get PDF
    In safety-critical systems such as Air Traffic Control system, SCADA systems, Railways Control Systems, there has been a rapid transition from monolithic systems to highly modular ones, using off-the-shelf hardware and software applications possibly developed by different manufactures. This shift increased the probability that a fault occurring in an application propagates to others with the risk of a failure of the entire safety-critical system. This calls for new tools for the on-line detection of anomalous behaviors of the system, predicting thus a system failure before it happens, allowing the deployment of appropriate mitigation policies. The paper proposes a novel architecture, namely CASPER, for online failure prediction that has the distinctive features to be (i) black-box: no knowledge of applications internals and logic of the system is required (ii) non-intrusive: no status information of the components is used such as CPU or memory usage; The architecture has been implemented to predict failures in a real Air Traffic Control System. CASPER exhibits high degree of accuracy in predicting failures with low false positive rate. The experimental validation shows how operators are provided with predictions issued a few hundred of seconds before the occurrence of the failure

    Flood hazard hydrology: interdisciplinary geospatial preparedness and policy

    Get PDF
    Thesis (Ph.D.) University of Alaska Fairbanks, 2017Floods rank as the deadliest and most frequently occurring natural hazard worldwide, and in 2013 floods in the United States ranked second only to wind storms in accounting for loss of life and damage to property. While flood disasters remain difficult to accurately predict, more precise forecasts and better understanding of the frequency, magnitude and timing of floods can help reduce the loss of life and costs associated with the impact of flood events. There is a common perception that 1) local-to-national-level decision makers do not have accurate, reliable and actionable data and knowledge they need in order to make informed flood-related decisions, and 2) because of science--policy disconnects, critical flood and scientific analyses and insights are failing to influence policymakers in national water resource and flood-related decisions that have significant local impact. This dissertation explores these perceived information gaps and disconnects, and seeks to answer the question of whether flood data can be accurately generated, transformed into useful actionable knowledge for local flood event decision makers, and then effectively communicated to influence policy. Utilizing an interdisciplinary mixed-methods research design approach, this thesis develops a methodological framework and interpretative lens for each of three distinct stages of flood-related information interaction: 1) data generation—using machine learning to estimate streamflow flood data for forecasting and response; 2) knowledge development and sharing—creating a geoanalytic visualization decision support system for flood events; and 3) knowledge actualization—using heuristic toolsets for translating scientific knowledge into policy action. Each stage is elaborated on in three distinct research papers, incorporated as chapters in this dissertation, that focus on developing practical data and methodologies that are useful to scientists, local flood event decision makers, and policymakers. Data and analytical results of this research indicate that, if certain conditions are met, it is possible to provide local decision makers and policy makers with the useful actionable knowledge they need to make timely and informed decisions

    Online Event Correlations Analysis in System Logs of Large-Scale Cluster Systems

    Full text link

    Checkpoint-based Fault-tolerant Infrastructure for Virtualized Service Providers

    Get PDF
    Crash and omission failures are common in service providers: a disk can break down or a link can fail anytime. In addition, the probability of a node failure increases with the number of nodes. Apart from reducing the provider’s computation power and jeopardizing the fulfillment of his contracts, this can also lead to computation time wasting when the crash occurs before finishing the task execution. In order to avoid this problem, efficient checkpoint infrastructures are required, especially in virtualized environments where these infrastructures must deal with huge virtual machine images. This paper proposes a smart checkpoint infrastructure for virtualized service providers. It uses Another Union File System to differentiate read-only from read-write parts in the virtual machine image. In this way, read-only parts can be checkpointed only once, while the rest of checkpoints must only save the modifications in read-write parts, thus reducing the time needed to make a checkpoint. The checkpoints are stored in a Hadoop Distributed File System. This allows resuming a task execution faster after a node crash and increasing the fault tolerance of the system, since checkpoints are distributed and replicated in all the nodes of the provider. This paper presents a running implementation of this infrastructure and its evaluation, demonstrating that it is an effective way to make faster checkpoints with low interference on task execution and efficient task recovery after a node failure.Peer ReviewedPostprint (published version

    Coordinating the Competition, Pre-electoral Coalitions in the Indian General Elections

    Get PDF
    The number and variety of pre-electoral coalitions in the Indian general elections make India a prime case to examine why parties chose to join forces with their rivals during elections. Yet, existing theories, which emphasise narrow definitions of party size and shared ideology, are unable to explain the tangled alliances that emerge between Indian political parties. In order to examine why parties pursue certain pre-electoral coalitions, I employ a mixed-methods strategy that combines statistical network analysis (exponential random graph models) with case study analysis, using a new dataset of pre-electoral coalitions 1999-2014. The network analysis suggests that pre-electoral coalitions in India are driven by the parties’ wish to increase their odds of winning in particular constituencies and, to a smaller degree, their wish to combine their parliamentary strength afterwards. The analysis also suggests that the network structure of the party system has a significant impact on pre- electoral coalition formation in that parties are attracted to ‘high-connector parties’ that allow them to form indirect alliances with a number of parties, and that parties build denser, regional coalitions that allow smaller parties to buy leverage against bigger allies. Finally, even though pre-electoral coalitions in India appear highly changeable, parties are more likely to renew an existing pre-electoral coalition than to build a new one. I explore the implications of the network analysis in three case studies, namely a pre- electoral coalition that took place as the model predicted (a true positive case), one that did not take pace despite being predicted (a false positive case), and one that took place despite not being predicted (a false negative case). The case studies corroborate the statistical findings but also demonstrate that network structures can both encourage and hinder pre-electoral coalitions
    corecore