1,328 research outputs found
Practical issues for the implementation of survivability and recovery techniques in optical networks
TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems
Root Cause Analysis (RCA) is becoming increasingly crucial for ensuring the
reliability of microservice systems. However, performing RCA on modern
microservice systems can be challenging due to their large scale, as they
usually comprise hundreds of components, leading significant human effort. This
paper proposes TraceDiag, an end-to-end RCA framework that addresses the
challenges for large-scale microservice systems. It leverages reinforcement
learning to learn a pruning policy for the service dependency graph to
automatically eliminates redundant components, thereby significantly improving
the RCA efficiency. The learned pruning policy is interpretable and fully
adaptive to new RCA instances. With the pruned graph, a causal-based method can
be executed with high accuracy and efficiency. The proposed TraceDiag framework
is evaluated on real data traces collected from the Microsoft Exchange system,
and demonstrates superior performance compared to state-of-the-art RCA
approaches. Notably, TraceDiag has been integrated as a critical component in
the Microsoft M365 Exchange, resulting in a significant improvement in the
system's reliability and a considerable reduction in the human effort required
for RCA
AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges
Artificial Intelligence for IT operations (AIOps) aims to combine the power
of AI with the big data generated by IT Operations processes, particularly in
cloud infrastructures, to provide actionable insights with the primary goal of
maximizing availability. There are a wide variety of problems to address, and
multiple use-cases, where AI capabilities can be leveraged to enhance
operational efficiency. Here we provide a review of the AIOps vision, trends
challenges and opportunities, specifically focusing on the underlying AI
techniques. We discuss in depth the key types of data emitted by IT Operations
activities, the scale and challenges in analyzing them, and where they can be
helpful. We categorize the key AIOps tasks as - incident detection, failure
prediction, root cause analysis and automated actions. We discuss the problem
formulation for each task, and then present a taxonomy of techniques to solve
these problems. We also identify relatively under explored topics, especially
those that could significantly benefit from advances in AI literature. We also
provide insights into the trends in this field, and what are the key investment
opportunities
Active self-diagnosis in telecommunication networks
Les réseaux de télécommunications deviennent de plus en plus complexes, notamment de par la multiplicité des technologies mises en œuvre, leur couverture géographique grandissante, la croissance du trafic en quantité et en variété, mais aussi de par l évolution des services fournis par les opérateurs. Tout ceci contribue à rendre la gestion de ces réseaux de plus en plus lourde, complexe, génératrice d erreurs et donc coûteuse pour les opérateurs. On place derrière le terme réseaux autonome l ensemble des solutions visant à rendre la gestion de ce réseau plus autonome. L objectif de cette thèse est de contribuer à la réalisation de certaines fonctions autonomiques dans les réseaux de télécommunications. Nous proposons une stratégie pour automatiser la gestion des pannes tout en couvrant les différents segments du réseau et les services de bout en bout déployés au-dessus. Il s agit d une approche basée modèle qui adresse les deux difficultés du diagnostic basé modèle à savoir : a) la façon d'obtenir un tel modèle, adapté à un réseau donné à un moment donné, en particulier si l'on souhaite capturer plusieurs couches réseau et segments et b) comment raisonner sur un modèle potentiellement énorme, si l'on veut gérer un réseau national par exemple. Pour répondre à la première difficulté, nous proposons un nouveau concept : l auto-modélisation qui consiste d abord à construire les différentes familles de modèles génériques, puis à identifier à la volée les instances de ces modèles qui sont déployées dans le réseau géré. La seconde difficulté est adressée grâce à un moteur d auto-diagnostic actif, basé sur le formalisme des réseaux Bayésiens et qui consiste à raisonner sur un fragment du modèle du réseau qui est augmenté progressivement en utilisant la capacité d auto-modélisation: des observations sont collectées et des tests réalisés jusqu à ce que les fautes soient localisées avec une certitude suffisante. Cette approche de diagnostic actif a été expérimentée pour réaliser une gestion multi-couches et multi-segments des alarmes dans un réseau IMS.While modern networks and services are continuously growing in scale, complexity and heterogeneity, the management of such systems is reaching the limits of human capabilities. Technically and economically, more automation of the classical management tasks is needed. This has triggered a significant research effort, gathered under the terms self-management and autonomic networking. The aim of this thesis is to contribute to the realization of some self-management properties in telecommunication networks. We propose an approach to automatize the management of faults, covering the different segments of a network, and the end-to-end services deployed over them. This is a model-based approach addressing the two weaknesses of model-based diagnosis namely: a) how to derive such a model, suited to a given network at a given time, in particular if one wishes to capture several network layers and segments and b) how to reason a potentially huge model, if one wishes to manage a nation-wide network for example. To address the first point, we propose a new concept called self-modeling that formulates off-line generic patterns of the model, and identifies on-line the instances of these patterns that are deployed in the managed network. The second point is addressed by an active self-diagnosis engine, based on a Bayesian network formalism, that consists in reasoning on a progressively growing fragment of the network model, relying on the self-modeling ability: more observations are collected and new tests are performed until the faults are localized with sufficient confidence. This active diagnosis approach has been experimented to perform cross-layer and cross-segment alarm management on an IMS network.RENNES1-Bibl. électronique (352382106) / SudocSudocFranceF
A Machine Learning Enhanced Scheme for Intelligent Network Management
The versatile networking services bring about huge influence on daily living styles while the amount and diversity of services cause high complexity of network systems. The network scale and complexity grow with the increasing infrastructure apparatuses, networking function, networking slices, and underlying architecture evolution. The conventional way is manual administration to maintain the large and complex platform, which makes effective and insightful management troublesome. A feasible and promising scheme is to extract insightful information from largely produced network data. The goal of this thesis is to use learning-based algorithms inspired by machine learning communities to discover valuable knowledge from substantial network data, which directly promotes intelligent management and maintenance. In the thesis, the management and maintenance focus on two schemes: network anomalies detection and root causes localization; critical traffic resource control and optimization. Firstly, the abundant network data wrap up informative messages but its heterogeneity and perplexity make diagnosis challenging. For unstructured logs, abstract and formatted log templates are extracted to regulate log records. An in-depth analysis framework based on heterogeneous data is proposed in order to detect the occurrence of faults and anomalies. It employs representation learning methods to map unstructured data into numerical features, and fuses the extracted feature for network anomaly and fault detection. The representation learning makes use of word2vec-based embedding technologies for semantic expression. Next, the fault and anomaly detection solely unveils the occurrence of events while failing to figure out the root causes for useful administration so that the fault localization opens a gate to narrow down the source of systematic anomalies. The extracted features are formed as the anomaly degree coupled with an importance ranking method to highlight the locations of anomalies in network systems. Two types of ranking modes are instantiated by PageRank and operation errors for jointly highlighting latent issue of locations. Besides the fault and anomaly detection, network traffic engineering deals with network communication and computation resource to optimize data traffic transferring efficiency. Especially when network traffic are constrained with communication conditions, a pro-active path planning scheme is helpful for efficient traffic controlling actions. Then a learning-based traffic planning algorithm is proposed based on sequence-to-sequence model to discover hidden reasonable paths from abundant traffic history data over the Software Defined Network architecture. Finally, traffic engineering merely based on empirical data is likely to result in stale and sub-optimal solutions, even ending up with worse situations. A resilient mechanism is required to adapt network flows based on context into a dynamic environment. Thus, a reinforcement learning-based scheme is put forward for dynamic data forwarding considering network resource status, which explicitly presents a promising performance improvement. In the end, the proposed anomaly processing framework strengthens the analysis and diagnosis for network system administrators through synthesized fault detection and root cause localization. The learning-based traffic engineering stimulates networking flow management via experienced data and further shows a promising direction of flexible traffic adjustment for ever-changing environments
AI Solutions for MDS: Artificial Intelligence Techniques for Misuse Detection and Localisation in Telecommunication Environments
This report considers the application of Articial Intelligence (AI) techniques to
the problem of misuse detection and misuse localisation within telecommunications
environments. A broad survey of techniques is provided, that covers inter alia
rule based systems, model-based systems, case based reasoning, pattern matching,
clustering and feature extraction, articial neural networks, genetic algorithms, arti
cial immune systems, agent based systems, data mining and a variety of hybrid
approaches. The report then considers the central issue of event correlation, that
is at the heart of many misuse detection and localisation systems. The notion of
being able to infer misuse by the correlation of individual temporally distributed
events within a multiple data stream environment is explored, and a range of techniques,
covering model based approaches, `programmed' AI and machine learning
paradigms. It is found that, in general, correlation is best achieved via rule based approaches,
but that these suffer from a number of drawbacks, such as the difculty of
developing and maintaining an appropriate knowledge base, and the lack of ability
to generalise from known misuses to new unseen misuses. Two distinct approaches
are evident. One attempts to encode knowledge of known misuses, typically within
rules, and use this to screen events. This approach cannot generally detect misuses
for which it has not been programmed, i.e. it is prone to issuing false negatives.
The other attempts to `learn' the features of event patterns that constitute normal
behaviour, and, by observing patterns that do not match expected behaviour, detect
when a misuse has occurred. This approach is prone to issuing false positives,
i.e. inferring misuse from innocent patterns of behaviour that the system was not
trained to recognise. Contemporary approaches are seen to favour hybridisation,
often combining detection or localisation mechanisms for both abnormal and normal
behaviour, the former to capture known cases of misuse, the latter to capture
unknown cases. In some systems, these mechanisms even work together to update
each other to increase detection rates and lower false positive rates. It is concluded
that hybridisation offers the most promising future direction, but that a rule or state
based component is likely to remain, being the most natural approach to the correlation
of complex events. The challenge, then, is to mitigate the weaknesses of
canonical programmed systems such that learning, generalisation and adaptation
are more readily facilitated
- …