2,791 research outputs found
The New Abnormal: Network Anomalies in the AI Era
Anomaly detection aims at finding unexpected patterns in data. It has been used in several problems in computer networks, from the detection of port scans and DDoS attacks to the monitoring of time-series collected from Internet monitoring systems. Data-driven approaches and machine learning have seen widespread application on anomaly detection too, and this trend has been accelerated by the recent developments on Artificial Intelligence research. This chapter summarizes ongoing recent progresses on anomaly detection research. In particular, we evaluate how developments on AI algorithms bring new possibilities for anomaly detection. We cover new representation learning techniques such as Generative Artificial Networks and Autoencoders, as well as techniques that can be used to improve models learned with machine learning algorithms, such as reinforcement learning. We survey both research works and tools implementing AI algorithms for anomaly detection. We found that the novel algorithms, while successful in other fields, have hardly been applied to networking problems. We conclude the chapter with a case study that illustrates a possible research direction
Probabilistic Approach to Structural Change Prediction in Evolving Social Networks
We propose a predictive model of structural
changes in elementary subgraphs of social network based on
Mixture of Markov Chains. The model is trained and verified
on a dataset from a large corporate social network analyzed
in short, one day-long time windows, and reveals distinctive
patterns of evolution of connections on the level of local
network topology. We argue that the network investigated in
such short timescales is highly dynamic and therefore immune
to classic methods of link prediction and structural analysis,
and show that in the case of complex networks, the dynamic
subgraph mining may lead to better prediction accuracy. The
experiments were carried out on the logs from the Wroclaw
University of Technology mail server
Review and Analysis of Failure Detection and Prevention Techniques in IT Infrastructure Monitoring
Maintaining the health of IT infrastructure components for improved reliability and availability is a research and innovation topic for many years. Identification and handling of failures are crucial and challenging due to the complexity of IT infrastructure. System logs are the primary source of information to diagnose and fix failures.
In this work, we address three essential research dimensions about failures, such as the need for failure handling in IT infrastructure, understanding the contribution of system-generated log in failure detection and reactive & proactive approaches used to deal with failure situations.
This study performs a comprehensive analysis of existing literature by considering three prominent aspects as log preprocessing, anomaly & failure detection, and failure prevention.
With this coherent review, we (1) presume the need for IT infrastructure monitoring to avoid downtime, (2) examine the three types of approaches for anomaly and failure detection such as a rule-based, correlation method and classification, and (3) fabricate the recommendations for researchers on further research guidelines.
As far as the authors\u27 knowledge, this is the first comprehensive literature review on IT infrastructure monitoring techniques. The review has been conducted with the help of meta-analysis and comparative study of machine learning and deep learning techniques. This work aims to outline significant research gaps in the area of IT infrastructure failure detection. This work will help future researchers understand the advantages and limitations of current methods and select an adequate approach to their problem
Recommended from our members
Leveraging Distributed Tracing and Container Cloning for Replay Debugging of Microservices
Microservice architectures have gained prominence in recent years for building large-scale industrial distributed systems. However, microservice architectures make the usage of replay debugging, a powerful technique for finding root causes of faults, very challenging because of the polyglot (written in several languages) services, large accumulated state of services, and tight latency limits imposed by long hop-chains. This work attempts to provide a framework for enabling replay debugging in production microservice applications. We study 25 real-world faults in microservice systems collected from diverse sources, categorize these faults by fault symptoms, and create 15 application agnostic mutation operators for microservices. We then propose a language agnostic replay debugging framework for microservice applications that uses a distributed tracing system to record network requests and enables replay of those requests on cloned service containers running in a debug environment. A key component of this framework is an anomaly detector that uses span-level and container-level monitoring to detect fault symptoms found in our study and localizes faults to trace level so that faulty traces can be easily replayed to find the root cause. An open-source microservices application injected successively with the mutation operators is used for an evaluation that shows that our framework is upto an order of magnitude lighter-weight than language-specific recording tools such as Chrome DevTools or VisualVM and can help in finding root causes of 9 out of 15 mutations at a line or function level
PerfCE: Performance Debugging on Databases with Chaos Engineering-Enhanced Causality Analysis
Debugging performance anomalies in real-world databases is challenging.
Causal inference techniques enable qualitative and quantitative root cause
analysis of performance downgrade. Nevertheless, causality analysis is
practically challenging, particularly due to limited observability. Recently,
chaos engineering has been applied to test complex real-world software systems.
Chaos frameworks like Chaos Mesh mutate a set of chaos variables to inject
catastrophic events (e.g., network slowdowns) to "stress" software systems. The
systems under chaos stress are then tested using methods like differential
testing to check if they retain their normal functionality (e.g., SQL query
output is always correct under stress). Despite its ubiquity in the industry,
chaos engineering is now employed mostly to aid software testing rather for
performance debugging.
This paper identifies novel usage of chaos engineering on helping developers
diagnose performance anomalies in databases. Our presented framework, PERFCE,
comprises an offline phase and an online phase. The offline phase learns the
statistical models of the target database system, whilst the online phase
diagnoses the root cause of monitored performance anomalies on the fly. During
the offline phase, PERFCE leverages both passive observations and proactive
chaos experiments to constitute accurate causal graphs and structural equation
models (SEMs). When observing performance anomalies during the online phase,
causal graphs enable qualitative root cause identification (e.g., high CPU
usage) and SEMs enable quantitative counterfactual analysis (e.g., determining
"when CPU usage is reduced to 45\%, performance returns to normal"). PERFCE
notably outperforms prior works on common synthetic datasets, and our
evaluation on real-world databases, MySQL and TiDB, shows that PERFCE is highly
accurate and moderately expensive
- …