15,450 research outputs found

    Detecting Flow Anomalies in Distributed Systems

    Get PDF
    Deep within the networks of distributed systems, one often finds anomalies that affect their efficiency and performance. These anomalies are difficult to detect because the distributed systems may not have sufficient sensors to monitor the flow of traffic within the interconnected nodes of the networks. Without early detection and making corrections, these anomalies may aggravate over time and could possibly cause disastrous outcomes in the system in the unforeseeable future. Using only coarse-grained information from the two end points of network flows, we propose a network transmission model and a localization algorithm, to detect the location of anomalies and rank them using a proposed metric within distributed systems. We evaluate our approach on passengers' records of an urbanized city's public transportation system and correlate our findings with passengers' postings on social media microblogs. Our experiments show that the metric derived using our localization algorithm gives a better ranking of anomalies as compared to standard deviation measures from statistical models. Our case studies also demonstrate that transportation events reported in social media microblogs matches the locations of our detect anomalies, suggesting that our algorithm performs well in locating the anomalies within distributed systems

    DC-Prophet: Predicting Catastrophic Machine Failures in DataCenters

    Full text link
    When will a server fail catastrophically in an industrial datacenter? Is it possible to forecast these failures so preventive actions can be taken to increase the reliability of a datacenter? To answer these questions, we have studied what are probably the largest, publicly available datacenter traces, containing more than 104 million events from 12,500 machines. Among these samples, we observe and categorize three types of machine failures, all of which are catastrophic and may lead to information loss, or even worse, reliability degradation of a datacenter. We further propose a two-stage framework-DC-Prophet-based on One-Class Support Vector Machine and Random Forest. DC-Prophet extracts surprising patterns and accurately predicts the next failure of a machine. Experimental results show that DC-Prophet achieves an AUC of 0.93 in predicting the next machine failure, and a F3-score of 0.88 (out of 1). On average, DC-Prophet outperforms other classical machine learning methods by 39.45% in F3-score.Comment: 13 pages, 5 figures, accepted by 2017 ECML PKD

    Reliability analysis and fault-tolerant system development for a redundant strapdown inertial measurement unit

    Get PDF
    A methodology is developed and applied for quantitatively analyzing the reliability of a dual, fail-operational redundant strapdown inertial measurement unit (RSDIMU). A Markov evaluation model is defined in terms of the operational states of the RSDIMU to predict system reliability. A 27 state model is defined based upon a candidate redundancy management system which can detect and isolate a spectrum of failure magnitudes. The results of parametric studies are presented which show the effect on reliability of the gyro failure rate, both the gyro and accelerometer failure rates together, false alarms, probability of failure detection, probability of failure isolation, and probability of damage effects and mission time. A technique is developed and evaluated for generating dynamic thresholds for detecting and isolating failures of the dual, separated IMU. Special emphasis is given to the detection of multiple, nonconcurrent failures. Digital simulation time histories are presented which show the thresholds obtained and their effectiveness in detecting and isolating sensor failures

    BSML: A Binding Schema Markup Language for Data Interchange in Problem Solving Environments (PSEs)

    Full text link
    We describe a binding schema markup language (BSML) for describing data interchange between scientific codes. Such a facility is an important constituent of scientific problem solving environments (PSEs). BSML is designed to integrate with a PSE or application composition system that views model specification and execution as a problem of managing semistructured data. The data interchange problem is addressed by three techniques for processing semistructured data: validation, binding, and conversion. We present BSML and describe its application to a PSE for wireless communications system design
    corecore