5,330 research outputs found

    Path analysis for process troubleshooting

    Get PDF

    Strider: a black-box, state-based approach to change and configuration management and support

    Get PDF
    AbstractWe describe a new approach, called Strider, to Change and Configuration Management and Support (CCMS). Strider is a black-box approach: without relying on specifications, it uses state differencing to identify potential causes of differing program behaviors, uses state tracing to identify actual, run-time state dependencies, and uses statistical behavior modeling for noise filtering. Strider is a state-based approach: instead of linking vague, high level descriptions and symptoms to relevant actions, it models management and support problems in terms of individual, named pieces of low level configuration state and provides precise mappings to user-friendly information through a computer genomics database. We use troubleshooting of configuration failures to demonstrate that the Strider approach reduces problem complexity by several orders of magnitude, making root-cause analysis possible

    Monitoring and analysis system for performance troubleshooting in data centers

    Get PDF
    It was not long ago. On Christmas Eve 2012, a war of troubleshooting began in Amazon data centers. It started at 12:24 PM, with an mistaken deletion of the state data of Amazon Elastic Load Balancing Service (ELB for short), which was not realized at that time. The mistake first led to a local issue that a small number of ELB service APIs were affected. In about six minutes, it evolved into a critical one that EC2 customers were significantly affected. One example was that Netflix, which was using hundreds of Amazon ELB services, was experiencing an extensive streaming service outage when many customers could not watch TV shows or movies on Christmas Eve. It took Amazon engineers 5 hours 42 minutes to find the root cause, the mistaken deletion, and another 15 hours and 32 minutes to fully recover the ELB service. The war ended at 8:15 AM the next day and brought the performance troubleshooting in data centers to world’s attention. As shown in this Amazon ELB case.Troubleshooting runtime performance issues is crucial in time-sensitive multi-tier cloud services because of their stringent end-to-end timing requirements, but it is also notoriously difficult and time consuming. To address the troubleshooting challenge, this dissertation proposes VScope, a flexible monitoring and analysis system for online troubleshooting in data centers. VScope provides primitive operations which data center operators can use to troubleshoot various performance issues. Each operation is essentially a series of monitoring and analysis functions executed on an overlay network. We design a novel software architecture for VScope so that the overlay networks can be generated, executed and terminated automatically, on-demand. From the troubleshooting side, we design novel anomaly detection algorithms and implement them in VScope. By running anomaly detection algorithms in VScope, data center operators are notified when performance anomalies happen. We also design a graph-based guidance approach, called VFocus, which tracks the interactions among hardware and software components in data centers. VFocus provides primitive operations by which operators can analyze the interactions to find out which components are relevant to the performance issue. VScope’s capabilities and performance are evaluated on a testbed with over 1000 virtual machines (VMs). Experimental results show that the VScope runtime negligibly perturbs system and application performance, and requires mere seconds to deploy monitoring and analytics functions on over 1000 nodes. This demonstrates VScope’s ability to support fast operation and online queries against a comprehensive set of application to system/platform level metrics, and a variety of representative analytics functions. When supporting algorithms with high computation complexity, VScope serves as a ‘thin layer’ that occupies no more than 5% of their total latency. Further, by using VFocus, VScope can locate problematic VMs that cannot be found via solely application-level monitoring, and in one of the use cases explored in the dissertation, it operates with levels of perturbation of over 400% less than what is seen for brute-force and most sampling-based approaches. We also validate VFocus with real-world data center traces. The experimental results show that VFocus has troubleshooting accuracy of 83% on average.Ph.D

    Investigating the role of model-based reasoning while troubleshooting an electric circuit

    Full text link
    We explore the overlap of two nationally-recognized learning outcomes for physics lab courses, namely, the ability to model experimental systems and the ability to troubleshoot a malfunctioning apparatus. Modeling and troubleshooting are both nonlinear, recursive processes that involve using models to inform revisions to an apparatus. To probe the overlap of modeling and troubleshooting, we collected audiovisual data from think-aloud activities in which eight pairs of students from two institutions attempted to diagnose and repair a malfunctioning electrical circuit. We characterize the cognitive tasks and model-based reasoning that students employed during this activity. In doing so, we demonstrate that troubleshooting engages students in the core scientific practice of modeling.Comment: 20 pages, 6 figures, 4 tables; Submitted to Physical Review PE

    Knowledge-based diagnosis for aerospace systems

    Get PDF
    The need for automated diagnosis in aerospace systems and the approach of using knowledge-based systems are examined. Research issues in knowledge-based diagnosis which are important for aerospace applications are treated along with a review of recent relevant research developments in Artificial Intelligence. The design and operation of some existing knowledge-based diagnosis systems are described. The systems described and compared include the LES expert system for liquid oxygen loading at NASA Kennedy Space Center, the FAITH diagnosis system developed at the Jet Propulsion Laboratory, the PES procedural expert system developed at SRI International, the CSRL approach developed at Ohio State University, the StarPlan system developed by Ford Aerospace, the IDM integrated diagnostic model, and the DRAPhys diagnostic system developed at NASA Langley Research Center

    Track: Tracerouting in SDN networks with arbitrary network functions

    Get PDF
    The centralization of control plane in Software defined networking (SDN) creates a paramount challenge on troubleshooting the network as packets are ultimately forwarded by distributed data planes. Existing path tracing tools largely utilize packet tags to probe network paths among SDN-enabled switches. However, network functions (NFs) or middleboxes, whose presence is ubiquitous in today's networks, can drop packets or alter their tags - an action that can collapse the probing mechanism. In addition, sending probing packets through network functions could corrupt their internal states, risking of the correctness of servicing logic (e.g., incorrect load balancing decisions). In this paper, we present a novel troubleshooting tool, Track, for SDN-enabled network with arbitrary NFs. Track can discover the forwarding path including NFs taken by any packets, without changing the forwarding rules in switches and internal states of NFs. We have implemented Track on RYU controller. Our extensive experiment results show that Track can achieve 95.08% and 100% accuracy for discovering forwarding paths with and without NFs respectively, and can efficiently generate traces within 3 milliseconds per hop
    • …
    corecore