Search CORE

1,978 research outputs found

Automata for Web Services Fault Monitoring and Diagnosis

Author: Mohanty Hrushikesha
N Lakshmi H
Publication venue: Institute for Project Management Pvt. Ltd
Publication date: 04/09/2020
Field of study

Like any software, web service fault management is also required to go through different phases of fault management lifecycle. Model based diagnosis has been a well established practice for its several positive aspects including cognitively being better understood by development and testing teams. Automata is a simple and formally well defined model being used for monitoring and diagnosis of system faults. For the reason, here we have reviewed works on automata for web service fault management and also propose a model of stochastic automata for the purpose

Interscience Research Network

Why (and How) Networks Should Run Themselves

Author: Feamster Nick
Rexford Jennifer
Publication venue
Publication date: 31/10/2017
Field of study

The proliferation of networked devices, systems, and applications that we depend on every day makes managing networks more important than ever. The increasing security, availability, and performance demands of these applications suggest that these increasingly difficult network management problems be solved in real time, across a complex web of interacting protocols and systems. Alas, just as the importance of network management has increased, the network has grown so complex that it is seemingly unmanageable. In this new era, network management requires a fundamentally new approach. Instead of optimizations based on closed-form analysis of individual protocols, network operators need data-driven, machine-learning-based models of end-to-end and application performance based on high-level policy goals and a holistic view of the underlying components. Instead of anomaly detection algorithms that operate on offline analysis of network traces, operators need classification and detection algorithms that can make real-time, closed-loop decisions. Networks should learn to drive themselves. This paper explores this concept, discussing how we might attain this ambitious goal by more closely coupling measurement with real-time control and by relying on learning for inference and prediction about a networked application or system, as opposed to closed-form analysis of individual protocols

arXiv.org e-Print Archive

Princeton University Open Access Repository

Crossref

Recommended from our members

DYSWIS: Collaborative Network Fault Diagnosis - Of End-users, By End-users, For End-users

Author: Kim Kyung Hwa
Schulzrinne Henning G.
Singh Vishal
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2011
Field of study

With increase in application complexity, the need for network faults diagnosis for end-users has increased. However, existing failure diagnosis techniques fail to assist the endusers in accessing the applications and services. We present DYSWIS, an automatic network fault detection and diagnosis system for end-users. The key idea is collaboration of end-users; a node requests multiple nodes to diagnose a network fault in real time to collect diverse information from different parts of the networks and infer the cause of failure. DYSWIS leverages DHT network to search the collaborating nodes with appropriate network properties required to diagnose a failure. The framework allows dynamic updating of rules and probes into a running system. Another key aspect is contribution of expert knowledge (rules and probes) by application developers, vendors and network administrators; thereby enabling crowdsourcing of diagnosis strategy for growing set of applications. We have implemented the framework and the software and tested them using our test bed and PlanetLab to show that several complex commonly occurring failures can be detected and diagnosed successfully using DYSWIS, while single-user probe with traditional tools fails to pinpoint the cause of such failures. We validate that our base modules and rules are sufficient to detect infrastructural failures causing majority of application failures

Columbia University Academic Commons

Automating Performance Diagnosis in Networked Systems

Author: McCann Justin N.
Publication venue
Publication date: 01/01/2012
Field of study

Diagnosing performance degradation in distributed systems is a complex and difficult task. Software that performs well in one environment may be unusably slow in another, and determining the root cause is time-consuming and error-prone, even in environments in which all the data may be available. End users have an even more difficult time trying to diagnose system performance, since both software and network problems have the same symptom: a stalled application. The central thesis of this dissertation is that the source of performance stalls in a distributed system can be automatically detected and diagnosed with very limited information: the dependency graph of data flows through the system, and a few counters common to almost all data processing systems. This dissertation presents FlowDiagnoser, an automated approach for diagnosing performance stalls in networked systems. FlowDiagnoser requires as little as two bits of information per module to make a diagnosis: one to indicate whether the module is actively processing data, and one to indicate whether the module is waiting on its dependents. To support this thesis, FlowDiagnoser is implemented in two distinct environments: an individual host's networking stack, and a distributed streams processing system. In controlled experiments using real applications, FlowDiagnoser correctly diagnoses 99% of networking-related stalls due to application, connection-specific, or network-wide performance problems, with a false positive rate under 3%. The prototype system for diagnosing messaging stalls in a commercial streams processing system correctly finds 93% of message-processing stalls, with a false positive rate of 2%

Digital Repository at the University of Maryland

Resilience Strategies for Network Challenge Detection, Identification and Remediation

Author: Yu Yue
Publication venue: Faculty of Engineering and Information Technologies, School of Information Technologies
Publication date: 01/01/2014
Field of study

The enormous growth of the Internet and its use in everyday life make it an attractive target for malicious users. As the network becomes more complex and sophisticated it becomes more vulnerable to attack. There is a pressing need for the future internet to be resilient, manageable and secure. Our research is on distributed challenge detection and is part of the EU Resumenet Project (Resilience and Survivability for Future Networking: Framework, Mechanisms and Experimental Evaluation). It aims to make networks more resilient to a wide range of challenges including malicious attacks, misconfiguration, faults, and operational overloads. Resilience means the ability of the network to provide an acceptable level of service in the face of significant challenges; it is a superset of commonly used definitions for survivability, dependability, and fault tolerance. Our proposed resilience strategy could detect a challenge situation by identifying an occurrence and impact in real time, then initiating appropriate remedial action. Action is autonomously taken to continue operations as much as possible and to mitigate the damage, and allowing an acceptable level of service to be maintained. The contribution of our work is the ability to mitigate a challenge as early as possible and rapidly detect its root cause. Also our proposed multi-stage policy based challenge detection system identifies both the existing and unforeseen challenges. This has been studied and demonstrated with an unknown worm attack. Our multi stage approach reduces the computation complexity compared to the traditional single stage, where one particular managed object is responsible for all the functions. The approach we propose in this thesis has the flexibility, scalability, adaptability, reproducibility and extensibility needed to assist in the identification and remediation of many future network challenges

Sydney eScholarship

Utilities reforms and corruption in developing countries

Author: Estache Antonio
Goicoechea Ana
Trujillo Lourdes
Publication venue
Publication date
Field of study

This paper shows empirically that"privatization"in the energy, telecommunications, and water sectors, and the introduction of independent regulators in those sectors, have not always had the expected effects on access, affordability, or quality of services. It also shows that corruption leads to adjustments in the quantity, quality, and price of services consistent with the profit-maximizing behavior that one would expect from monopolies in the sector. The results suggest that privatization and the introduction of independent regulators have, at best, only partial effects on the consequences of corruption for access, affordability, and quality of utility services.Infrastructure Regulation,Energy Production and Transportation,Town Water Supply and Sanitation,Social Accountability,ICT Policy and Strategies

Research Papers in Economics

Extending Provenance For Deep Diagnosis Of Distributed Systems

Author: Wu Yang
Publication venue: ScholarlyCommons
Publication date: 01/01/2017
Field of study

Diagnosing and repairing problems in complex distributed systems has always been challenging. A wide variety of problems can happen in distributed systems: routers can be misconfigured, nodes can be hacked, and the control software can have bugs. This is further complicated by the complexity and scale of today’s distributed systems. Provenance is an attractive way to diagnose faults in distributed systems, because it can track the causality from a symptom to a set of root causes. Prior work on network provenance has successfully applied provenance to distributed systems. However, they cannot explain problems beyond the presence of faulty events and offer limited help with finding repairs. In this dissertation, we extend provenance to handle diagnostics problems that require deeper investigations. We propose three different extensions: negative provenance explains not just the presence but also the absence of events (such as missing packets); meta provenance can suggest repairs by tracking causality not only for data but also for code (such as bugs in control plane programs); temporal provenance tracks causality at the temporal level and aims at diagnosing timing-related faults (such as slow requests). Compared to classical network provenance, our approach tracks richer causality at runtime and applies more sophisticated reasoning and post-processing. We apply the above techniques to software-defined networking and the border gateway protocol. Evaluations with real world traffic and topology show that our systems can diagnose and repair practical problems, and that the runtime overhead as well as the query turnarounds are reasonable

ScholarlyCommons@Penn

Fault diagnosis for IP-based network with real-time conditions

Author: Vargas Arcila Ángela María
Publication venue
Publication date: 25/01/2022
Field of study

BACKGROUND: Fault diagnosis techniques have been based on many paradigms, which derive from diverse areas and have different purposes: obtaining a representation model of the network for fault localization, selecting optimal probe sets for monitoring network devices, reducing fault detection time, and detecting faulty components in the network. Although there are several solutions for diagnosing network faults, there are still challenges to be faced: a fault diagnosis solution needs to always be available and able enough to process data timely, because stale results inhibit the quality and speed of informed decision-making. Also, there is no non-invasive technique to continuously diagnose the network symptoms without leaving the system vulnerable to any failures, nor a resilient technique to the network's dynamic changes, which can cause new failures with different symptoms. AIMS: This thesis aims to propose a model for the continuous and timely diagnosis of IP-based networks faults, independent of the network structure, and based on data analytics techniques. METHOD(S): This research's point of departure was the hypothesis of a fault propagation phenomenon that allows the observation of failure symptoms at a higher network level than the fault origin. Thus, for the model's construction, monitoring data was collected from an extensive campus network in which impact link failures were induced at different instants of time and with different duration. These data correspond to widely used parameters in the actual management of a network. The collected data allowed us to understand the faults' behavior and how they are manifested at a peripheral level. Based on this understanding and a data analytics process, the first three modules of our model, named PALADIN, were proposed (Identify, Collection and Structuring), which define the data collection peripherally and the necessary data pre-processing to obtain the description of the network's state at a given moment. These modules give the model the ability to structure the data considering the delays of the multiple responses that the network delivers to a single monitoring probe and the multiple network interfaces that a peripheral device may have. Thus, a structured data stream is obtained, and it is ready to be analyzed. For this analysis, it was necessary to implement an incremental learning framework that respects networks' dynamic nature. It comprises three elements, an incremental learning algorithm, a data rebalancing strategy, and a concept drift detector. This framework is the fourth module of the PALADIN model named Diagnosis. In order to evaluate the PALADIN model, the Diagnosis module was implemented with 25 different incremental algorithms, ADWIN as concept-drift detector and SMOTE (adapted to streaming scenario) as the rebalancing strategy. On the other hand, a dataset was built through the first modules of the PALADIN model (SOFI dataset), which means that these data are the incoming data stream of the Diagnosis module used to evaluate its performance. The PALADIN Diagnosis module performs an online classification of network failures, so it is a learning model that must be evaluated in a stream context. Prequential evaluation is the most used method to perform this task, so we adopt this process to evaluate the model's performance over time through several stream evaluation metrics. RESULTS: This research first evidences the phenomenon of impact fault propagation, making it possible to detect fault symptoms at a monitored network's peripheral level. It translates into non-invasive monitoring of the network. Second, the PALADIN model is the major contribution in the fault detection context because it covers two aspects. An online learning model to continuously process the network symptoms and detect internal failures. Moreover, the concept-drift detection and rebalance data stream components which make resilience to dynamic network changes possible. Third, it is well known that the amount of available real-world datasets for imbalanced stream classification context is still too small. That number is further reduced for the networking context. The SOFI dataset obtained with the first modules of the PALADIN model contributes to that number and encourages works related to unbalanced data streams and those related to network fault diagnosis. CONCLUSIONS: The proposed model contains the necessary elements for the continuous and timely diagnosis of IPbased network faults; it introduces the idea of periodical monitorization of peripheral network elements and uses data analytics techniques to process it. Based on the analysis, processing, and classification of peripherally collected data, it can be concluded that PALADIN achieves the objective. The results indicate that the peripheral monitorization allows diagnosing faults in the internal network; besides, the diagnosis process needs an incremental learning process, conceptdrift detection elements, and rebalancing strategy. The results of the experiments showed that PALADIN makes it possible to learn from the network manifestations and diagnose internal network failures. The latter was verified with 25 different incremental algorithms, ADWIN as concept-drift detector and SMOTE (adapted to streaming scenario) as the rebalancing strategy. This research clearly illustrates that it is unnecessary to monitor all the internal network elements to detect a network's failures; instead, it is enough to choose the peripheral elements to be monitored. Furthermore, with proper processing of the collected status and traffic descriptors, it is possible to learn from the arriving data using incremental learning in cooperation with data rebalancing and concept drift approaches. This proposal continuously diagnoses the network symptoms without leaving the system vulnerable to failures while being resilient to the network's dynamic changes.Programa de Doctorado en Ciencia y Tecnología Informática por la Universidad Carlos III de MadridPresidente: José Manuel Molina López.- Secretario: Juan Carlos Dueñas López.- Vocal: Juan Manuel Corchado Rodrígue

Universidad Carlos III de Madrid e-Archivo

Model-based provisioning and management of adaptive distributed communication in mobile cooperative systems

Author: Bouassida Rodriguez Ismael
Chassot Christophe
Desprats Thierry
Drira Khalil
Michelle Sibilla
Ramanathan Sakkaravarthi
Publication venue: HAL CCSD
Publication date: 01/12/2011
Field of study

Adaptation of communication is required to maintain the reliable connection and to ensure the minimum quality in collaborative activities. Within the framework of wireless environment, how can host entities be handled in the event of a sudden unexpected change in communication and reliable sources? This challenging issue is addressed in the context of Emergency rescue system carried out by mobile devices and robots during calamities or disaster. For this kind of scenario, this book proposes an adaptive middleware to support reconfigurable, reliable group communications. Here, the system structure has been viewed at two different states, a control center with high processing power and uninterrupted energy level is responsible for global task and entities like autonomous robots and firemen owning smart devices act locally in the mission. Adaptation at control center is handled by semantic modeling whereas at local entities, it is managed by a software module called communication agent (CA). Modeling follows the well-known SWRL instructions which establish the degree of importance of each communication link or component. Providing generic and scalable solutions for automated self-configuration is driven by rule-based reconfiguration policies. To perform dynamically in changing environment, a trigger mechanism should force this model to take an adaptive action in order to accomplish a certain task, for example, the group chosen in the beginning of a mission need not be the same one during the whole mission. Local entity adaptive mechanisms are handled by CA that manages internal service APIs to configure, set up, and monitors communication services and manages the internal resources to satisfy telecom service requirements

Scientific Publications of the University of Toulouse II Le Mirail

HAL-INSA Toulouse

Enabling Richer Insight Into Runtime Executions Of Systems

Author: Nagaraj Karthik Swaminathan
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2013
Field of study

Systems software of very large scales are being heavily used today in various important scenarios such as online retail, banking, content services, web search and social networks. As the scale of functionality and complexity grows in these software, managing the implementations becomes a considerable challenge for developers, designers and maintainers. Software needs to be constantly monitored and tuned for optimal efficiency and user satisfaction. With large scale, these systems incorporate significant degrees of asynchrony, parallelism and distributed executions, reducing the manageability of software including performance management. Adding to the complexity, developers are under pressure between developing new functionality for customers and maintaining existing programs. This dissertation argues that the manual effort currently required to manage performance of these systems is very high, and can be automated to both reduce the likelihood of problems and quickly fix them once identified. The execution logs from these systems are easily available and provide rich information about the internals at runtime for diagnosis purposes, but the volume of logs is simply too large for today\u27s techniques. Developers hence spend many human hours observing and investigating executions of their systems during development and diagnosis of software, for performance management. This dissertation proposes the application of machine learning techniques to automatically analyze logs from executions, to challenging tasks in different phases of the software lifecycle. It is shown that the careful application of statistical techniques to features extracted from instrumentation, can distill the rich log data into easily comprehensible forms for the developers

Purdue E-Pubs