1,978 research outputs found
Automata for Web Services Fault Monitoring and Diagnosis
Like any software, web service fault management is also required to go through different phases of fault management lifecycle. Model based diagnosis has been a well established practice for its several positive aspects including cognitively being better understood by development and testing teams. Automata is a simple and formally well defined model being used for monitoring and diagnosis of system faults. For the reason, here we have reviewed works on automata for web service fault management and also propose a model of stochastic automata for the purpose
Why (and How) Networks Should Run Themselves
The proliferation of networked devices, systems, and applications that we
depend on every day makes managing networks more important than ever. The
increasing security, availability, and performance demands of these
applications suggest that these increasingly difficult network management
problems be solved in real time, across a complex web of interacting protocols
and systems. Alas, just as the importance of network management has increased,
the network has grown so complex that it is seemingly unmanageable. In this new
era, network management requires a fundamentally new approach. Instead of
optimizations based on closed-form analysis of individual protocols, network
operators need data-driven, machine-learning-based models of end-to-end and
application performance based on high-level policy goals and a holistic view of
the underlying components. Instead of anomaly detection algorithms that operate
on offline analysis of network traces, operators need classification and
detection algorithms that can make real-time, closed-loop decisions. Networks
should learn to drive themselves. This paper explores this concept, discussing
how we might attain this ambitious goal by more closely coupling measurement
with real-time control and by relying on learning for inference and prediction
about a networked application or system, as opposed to closed-form analysis of
individual protocols
Recommended from our members
DYSWIS: Collaborative Network Fault Diagnosis - Of End-users, By End-users, For End-users
With increase in application complexity, the need for network faults diagnosis for end-users has increased. However, existing failure diagnosis techniques fail to assist the endusers in accessing the applications and services. We present DYSWIS, an automatic network fault detection and diagnosis system for end-users. The key idea is collaboration of end-users; a node requests multiple nodes to diagnose a network fault in real time to collect diverse information from different parts of the networks and infer the cause of failure. DYSWIS leverages DHT network to search the collaborating nodes with appropriate network properties required to diagnose a failure. The framework allows dynamic updating of rules and probes into a running system. Another key aspect is contribution of expert knowledge (rules and probes) by application developers, vendors and network administrators; thereby enabling crowdsourcing of diagnosis strategy for growing set of applications. We have implemented the framework and the software and tested them using our test bed and PlanetLab to show that several complex commonly occurring failures can be detected and diagnosed successfully using DYSWIS, while single-user probe with traditional tools fails to pinpoint the cause of such failures. We validate that our base modules and rules are sufficient to detect infrastructural failures causing majority of application failures
Automating Performance Diagnosis in Networked Systems
Diagnosing performance degradation in distributed systems is a complex and difficult task. Software that performs well in one environment may be unusably slow in another, and determining the root cause is time-consuming and error-prone, even in environments in which all the data may be available. End users have an even more difficult time trying to diagnose system performance, since both software and network problems have the same symptom: a stalled application.
The central thesis of this dissertation is that the source of performance stalls in a distributed system can be automatically detected and diagnosed with very limited information: the dependency graph of data flows through the system, and a few counters common to almost all data processing systems.
This dissertation presents FlowDiagnoser, an automated approach for diagnosing performance stalls in networked systems. FlowDiagnoser requires as little as two bits of information per module to make a diagnosis: one to indicate whether the module is actively processing data, and one to indicate whether the module is waiting on its dependents.
To support this thesis, FlowDiagnoser is implemented in two distinct environments: an individual host's networking stack, and a distributed streams processing system. In controlled experiments using real applications, FlowDiagnoser correctly diagnoses 99% of networking-related stalls due to application, connection-specific, or network-wide performance problems, with a false positive rate under 3%. The prototype system for diagnosing messaging stalls in a commercial streams processing system correctly finds 93% of message-processing stalls, with a false positive rate of 2%
Resilience Strategies for Network Challenge Detection, Identification and Remediation
The enormous growth of the Internet and its use in everyday life make it an attractive target for malicious users. As the network becomes more complex and sophisticated it becomes more vulnerable to attack. There is a pressing need for the future internet to be resilient, manageable and secure. Our research is on distributed challenge detection and is part of the EU Resumenet Project (Resilience and Survivability for Future Networking: Framework, Mechanisms and Experimental Evaluation). It aims to make networks more resilient to a wide range of challenges including malicious attacks, misconfiguration, faults, and operational overloads. Resilience means the ability of the network to provide an acceptable level of service in the face of significant challenges; it is a superset of commonly used definitions for survivability, dependability, and fault tolerance. Our proposed resilience strategy could detect a challenge situation by identifying an occurrence and impact in real time, then initiating appropriate remedial action. Action is autonomously taken to continue operations as much as possible and to mitigate the damage, and allowing an acceptable level of service to be maintained. The contribution of our work is the ability to mitigate a challenge as early as possible and rapidly detect its root cause. Also our proposed multi-stage policy based challenge detection system identifies both the existing and unforeseen challenges. This has been studied and demonstrated with an unknown worm attack. Our multi stage approach reduces the computation complexity compared to the traditional single stage, where one particular managed object is responsible for all the functions. The approach we propose in this thesis has the flexibility, scalability, adaptability, reproducibility and extensibility needed to assist in the identification and remediation of many future network challenges
Utilities reforms and corruption in developing countries
This paper shows empirically that"privatization"in the energy, telecommunications, and water sectors, and the introduction of independent regulators in those sectors, have not always had the expected effects on access, affordability, or quality of services. It also shows that corruption leads to adjustments in the quantity, quality, and price of services consistent with the profit-maximizing behavior that one would expect from monopolies in the sector. The results suggest that privatization and the introduction of independent regulators have, at best, only partial effects on the consequences of corruption for access, affordability, and quality of utility services.Infrastructure Regulation,Energy Production and Transportation,Town Water Supply and Sanitation,Social Accountability,ICT Policy and Strategies
Extending Provenance For Deep Diagnosis Of Distributed Systems
Diagnosing and repairing problems in complex distributed systems has always been challenging. A wide variety of problems can happen in distributed systems: routers can be misconfigured, nodes can be hacked, and the control software can have bugs. This is further complicated by the complexity and scale of todayâs distributed systems. Provenance is an attractive way to diagnose faults in distributed systems, because it can track the causality from a symptom to a set of root causes. Prior work on network provenance has successfully applied provenance to distributed systems. However, they cannot explain problems beyond the presence of faulty events and offer limited help with finding repairs.
In this dissertation, we extend provenance to handle diagnostics problems that require deeper investigations. We propose three different extensions: negative provenance explains not just the presence but also the absence of events (such as missing packets); meta provenance can suggest repairs by tracking causality not only for data but also for code (such as bugs in control plane programs); temporal provenance tracks causality at the temporal level and aims at diagnosing timing-related faults (such as slow requests). Compared to classical network provenance, our approach tracks richer causality at runtime and applies more sophisticated reasoning and post-processing. We apply the above techniques to software-defined networking and the border gateway protocol. Evaluations with real world traffic and topology show that our systems can diagnose and repair practical problems, and that the runtime overhead as well as the query turnarounds are reasonable
Fault diagnosis for IP-based network with real-time conditions
BACKGROUND:
Fault diagnosis techniques have been based on many paradigms, which derive from diverse areas
and have different purposes: obtaining a representation model of the network for fault localization,
selecting optimal probe sets for monitoring network devices, reducing fault detection time, and
detecting faulty components in the network. Although there are several solutions for diagnosing
network faults, there are still challenges to be faced: a fault diagnosis solution needs to always be
available and able enough to process data timely, because stale results inhibit the quality and speed
of informed decision-making. Also, there is no non-invasive technique to continuously diagnose the
network symptoms without leaving the system vulnerable to any failures, nor a resilient technique
to the network's dynamic changes, which can cause new failures with different symptoms.
AIMS:
This thesis aims to propose a model for the continuous and timely diagnosis of IP-based networks
faults, independent of the network structure, and based on data analytics techniques.
METHOD(S):
This research's point of departure was the hypothesis of a fault propagation phenomenon that
allows the observation of failure symptoms at a higher network level than the fault origin. Thus, for
the model's construction, monitoring data was collected from an extensive campus network in
which impact link failures were induced at different instants of time and with different duration.
These data correspond to widely used parameters in the actual management of a network. The
collected data allowed us to understand the faults' behavior and how they are manifested at a
peripheral level.
Based on this understanding and a data analytics process, the first three modules of our model,
named PALADIN, were proposed (Identify, Collection and Structuring), which define the data
collection peripherally and the necessary data pre-processing to obtain the description of the
network's state at a given moment. These modules give the model the ability to structure the data
considering the delays of the multiple responses that the network delivers to a single monitoring
probe and the multiple network interfaces that a peripheral device may have.
Thus, a structured data stream is obtained, and it is ready to be analyzed. For this analysis, it was
necessary to implement an incremental learning framework that respects networks' dynamic
nature. It comprises three elements, an incremental learning algorithm, a data rebalancing strategy,
and a concept drift detector. This framework is the fourth module of the PALADIN model named
Diagnosis.
In order to evaluate the PALADIN model, the Diagnosis module was implemented with 25 different
incremental algorithms, ADWIN as concept-drift detector and SMOTE (adapted to streaming scenario) as the rebalancing strategy. On the other hand, a dataset was built through the first
modules of the PALADIN model (SOFI dataset), which means that these data are the incoming data
stream of the Diagnosis module used to evaluate its performance.
The PALADIN Diagnosis module performs an online classification of network failures, so it is a
learning model that must be evaluated in a stream context. Prequential evaluation is the most used
method to perform this task, so we adopt this process to evaluate the model's performance over
time through several stream evaluation metrics.
RESULTS:
This research first evidences the phenomenon of impact fault propagation, making it possible to
detect fault symptoms at a monitored network's peripheral level. It translates into non-invasive
monitoring of the network. Second, the PALADIN model is the major contribution in the fault
detection context because it covers two aspects. An online learning model to continuously process
the network symptoms and detect internal failures. Moreover, the concept-drift detection and
rebalance data stream components which make resilience to dynamic network changes possible.
Third, it is well known that the amount of available real-world datasets for imbalanced stream
classification context is still too small. That number is further reduced for the networking context.
The SOFI dataset obtained with the first modules of the PALADIN model contributes to that number
and encourages works related to unbalanced data streams and those related to network fault
diagnosis.
CONCLUSIONS:
The proposed model contains the necessary elements for the continuous and timely diagnosis of IPbased
network faults; it introduces the idea of periodical monitorization of peripheral network
elements and uses data analytics techniques to process it. Based on the analysis, processing, and
classification of peripherally collected data, it can be concluded that PALADIN achieves the
objective. The results indicate that the peripheral monitorization allows diagnosing faults in the
internal network; besides, the diagnosis process needs an incremental learning process, conceptdrift
detection elements, and rebalancing strategy.
The results of the experiments showed that PALADIN makes it possible to learn from the network
manifestations and diagnose internal network failures. The latter was verified with 25 different
incremental algorithms, ADWIN as concept-drift detector and SMOTE (adapted to streaming
scenario) as the rebalancing strategy.
This research clearly illustrates that it is unnecessary to monitor all the internal network elements
to detect a network's failures; instead, it is enough to choose the peripheral elements to be
monitored. Furthermore, with proper processing of the collected status and traffic descriptors, it is
possible to learn from the arriving data using incremental learning in cooperation with data
rebalancing and concept drift approaches. This proposal continuously diagnoses the network
symptoms without leaving the system vulnerable to failures while being resilient to the network's
dynamic changes.Programa de Doctorado en Ciencia y TecnologĂa InformĂĄtica por la Universidad Carlos III de MadridPresidente: JosĂ© Manuel Molina LĂłpez.- Secretario: Juan Carlos Dueñas LĂłpez.- Vocal: Juan Manuel Corchado RodrĂgue
Model-based provisioning and management of adaptive distributed communication in mobile cooperative systems
Adaptation of communication is required to maintain the reliable connection and to ensure the minimum quality in collaborative activities. Within the framework of wireless environment, how can host entities be handled in the event of a sudden unexpected change in communication and reliable sources? This challenging issue is addressed in the context of Emergency rescue system carried out by mobile devices and robots during calamities or disaster. For this kind of scenario, this book proposes an adaptive middleware to support reconfigurable, reliable group communications. Here, the system structure has been viewed at two different states, a control center with high processing power and uninterrupted energy level is responsible for global task and entities like autonomous robots and firemen owning smart devices act locally in the mission. Adaptation at control center is handled by semantic modeling whereas at local entities, it is managed by a software module called communication agent (CA). Modeling follows the well-known SWRL instructions which establish the degree of importance of each communication link or component. Providing generic and scalable solutions for automated self-configuration is driven by rule-based reconfiguration policies. To perform dynamically in changing environment, a trigger mechanism should force this model to take an adaptive action in order to accomplish a certain task, for example, the group chosen in the beginning of a mission need not be the same one during the whole mission. Local entity adaptive mechanisms are handled by CA that manages internal service APIs to configure, set up, and monitors communication services and manages the internal resources to satisfy telecom service requirements
Enabling Richer Insight Into Runtime Executions Of Systems
Systems software of very large scales are being heavily used today in various important scenarios such as online retail, banking, content services, web search and social networks. As the scale of functionality and complexity grows in these software, managing the implementations becomes a considerable challenge for developers, designers and maintainers. Software needs to be constantly monitored and tuned for optimal efficiency and user satisfaction. With large scale, these systems incorporate significant degrees of asynchrony, parallelism and distributed executions, reducing the manageability of software including performance management. Adding to the complexity, developers are under pressure between developing new functionality for customers and maintaining existing programs. This dissertation argues that the manual effort currently required to manage performance of these systems is very high, and can be automated to both reduce the likelihood of problems and quickly fix them once identified. The execution logs from these systems are easily available and provide rich information about the internals at runtime for diagnosis purposes, but the volume of logs is simply too large for today\u27s techniques. Developers hence spend many human hours observing and investigating executions of their systems during development and diagnosis of software, for performance management. This dissertation proposes the application of machine learning techniques to automatically analyze logs from executions, to challenging tasks in different phases of the software lifecycle. It is shown that the careful application of statistical techniques to features extracted from instrumentation, can distill the rich log data into easily comprehensible forms for the developers
- âŠ