148 research outputs found
A Simple Approach to Detect Anomalies in Microservices-Based Systems Using PyOD
Ease of scale is one of the defining characteristics of microservices. However, with scalability comes the problem of diversity of services, making it very important to detect anomalies the soonest possible. Because it is recent, there are still few studies on the best approaches to detecting anomalies in microservices. This paper proposes the Python toolkit, PyOD, as an approach for microservice anomaly detection. This toolkit is composed of a set of anomaly detection algorithms, including classical LOF (SIGMOD2000) to the latest ECOD (TKDE2022). To evaluate the approach, we used two of its algorithms, k Nearest Neighbors (kNN) and Histogram-based Outlier Score (HBOS) to detect anomalies such as application bugs, CPU exhausted, and network jam on the TraceRCA dataset. This dataset contains logs from a real microservices system. The preliminary results show that HBOS algorithm performs better than kNN, with Recall and F1-Score of 93% and 89%, respectively, while for kNN these metrics were 92% and 85%, respectively
Real-Time QoS Monitoring and Anomaly Detection on Microservice-based Applications in Cloud-Edge Infrastructure
Ph. D. Thesis.Microservices have emerged as a new approach for developing and deploying cloud
applications that require higher levels of agility, scale, and reliability. A microservicebased
cloud application architecture advocates decomposition of monolithic application
components into independent software components called \microservices". As the
independent microservices can be developed, deployed, and updated independently of
each other, it leads to complex run-time performance monitoring and management
challenges. The deployment environment for microservices in multi-cloud environments
is very complex as there are numerous components running in heterogeneous
environments (VM/container) and communicating frequently with each other using
REST-based/REST-less APIs. In some cases, multiple components can also be executed
inside a VM/container making any failure or anomaly detection very complicated.
It is necessary to monitor the performance variation of all the service components
to detect any reason for failure.
Microservice and container architecture allows to design loose-coupled services and run
them in a lightweight runtime environment for more e cient scaling. Thus, containerbased
microservice deployment is now the standard model for hosting cloud applications
across industries. Despite the strongest scalability characteristic of this model
which opens the doors for further optimizations in both application structure and
performance, such characteristic adds an additional level of complexity to monitoring
application performance. Performance monitoring system can lead to severe application
outages if it is not able to successfully and quickly detecting failures and localizing
their causes. Machine learning-based techniques have been applied to detect anomalies
in microservice-based cloud-based applications. The existing research works used
di erent tracking algorithms to search the root cause if anomaly observed behaviour.
However, linking the observed failures of an application with their root causes by the
use of these techniques is still an open research problem.
Osmotic computing is a new IoT application programming paradigm that's driven
by the signi cant increase in resource capacity/capability at the network edge, along
with support for data transfer protocols that enable such resources to interact more
seamlessly with cloud-based services. Much of the di culty in Quality of Service (QoS)
and performance monitoring of IoT applications in an osmotic computing environment
is due to the massive scale and heterogeneity (IoT + edge + cloud) of computing
environments.
To handle monitoring and anomaly detection of microservices in cloud and edge datacenters,
this thesis presents multilateral research towards monitoring and anomaly
detection on microservice-based applications performance in cloud-edge infrastructure.
The key contributions of this thesis are as following:
• It introduces a novel system, Multi-microservices Multi-virtualization Multicloud
monitoring (M3 ) that provides a holistic approach to monitor the performance
of microservice-based application stacks deployed across multiple cloud
data centers.
• A framework forMonitoring, Anomaly Detection and Localization System (MADLS)
which utilizes a simpli ed approach that depends on commonly available metrics
o ering a simpli ed deployment environment for the developer.
• Developing a uni ed monitoring model for cloud-edge that provides an IoT application
administrator with detailed QoS information related to microservices
deployed across cloud and edge datacenters.Royal Embassy of Saudi Arabia Cultural
Bureau in London, government of Saudi Arabi
Recommended from our members
Model-based resource management for fine-grained services
The emergence of DevOps has changed the way modern distributed software systems are developed. Architectures decomposed in fine-grained services, such as microservices or function-as-a-service (FaaS), are now widespread across many organizations. From a resource management perspective, although the systems built with such architectures have many benefits, there are still research challenges that need further attention. In this study, we have focused on three such challenges, each concerning a specific system resource: compute, memory, or storage. Firstly, we focus on scaling the capacity of microservices at runtime. Here, the challenge is to design an autoscaler that can decide between vertical and horizontal scaling options to distribute the CPU capacity. Secondly, we focus on estimating the required capacity of an on-premises FaaS platform such that the service level agreements (SLAs) for function response times are satisfied. The challenge here is to address the cold start dilemma, i.e., that a cold start delays a function response but reduces the memory consumption. Thus, we must find a limit of cold starts such that the memory-consumption remains in-check while satisfying the SLAs. Finally, we focus on the storage management for distributed tracing targeted at microservices. The volume of such traces generated in a data center can be in the scale of tens of terabytes per day, but only a small fraction of these traces is useful for troubleshooting. The objective then is to sample only the useful traces. The key to addressing all these challenges is first, modeling the dynamics concerning the resources and subsequently, leveraging the model in a resource controller. To address the first challenge, we have developed an autoscaler ATOM that leverages layered queueing network (LQN) models to take its scaling decisions. Our experiment, with a real-life application, shows that ATOM produces 30-37% better results than the baseline autoscalers. For the second challenge, we have developed COCOA, a cold start aware capacity planner. COCOA utilizes M/M/k setup and LQN models to assess the cold start scenario and estimate the required capacity. We show with simulation that COCOA can reduce over-provisioning by over 70% compared to the availability aware approaches. Finally, addressing the third challenge, we propose SampleHST, a trace sampler that works under a storage budget constraint. SampleHST relies on either bag of words or graph-based models to represent a trace and groups similar traces using online clustering to perform sampling. We have evaluated the performance of SampleHST using data from both literature and production, which shows it produces 1.2x to 19x better results than the state-of-the-art.Open Acces
Report from GI-Dagstuhl Seminar 16394: Software Performance Engineering in the DevOps World
This report documents the program and the outcomes of GI-Dagstuhl Seminar
16394 "Software Performance Engineering in the DevOps World".
The seminar addressed the problem of performance-aware DevOps. Both, DevOps
and performance engineering have been growing trends over the past one to two
years, in no small part due to the rise in importance of identifying
performance anomalies in the operations (Ops) of cloud and big data systems and
feeding these back to the development (Dev). However, so far, the research
community has treated software engineering, performance engineering, and cloud
computing mostly as individual research areas. We aimed to identify
cross-community collaboration, and to set the path for long-lasting
collaborations towards performance-aware DevOps.
The main goal of the seminar was to bring together young researchers (PhD
students in a later stage of their PhD, as well as PostDocs or Junior
Professors) in the areas of (i) software engineering, (ii) performance
engineering, and (iii) cloud computing and big data to present their current
research projects, to exchange experience and expertise, to discuss research
challenges, and to develop ideas for future collaborations
A container orchestration development that optimizes the etherpad collaborative editing tool through a novel management system
The use of collaborative tools has notably increased recently. It is common to see distinct users that need to work simultaneously on shared documents. In most cases, large companies provide tools whose implementations have been a very complicated and expensive task. Likewise, their platform deployment requirements should be robust hardware infrastructures. It becomes even more critical when their main target is to reach scalability and highavailability. Therefore, this study aims to design and implement a microservices-based collaborative architecture using assembled containers in the cloud, enabling them to deploy Etherpad instances to guarantee high availability. To ensure such a task, we developed and optimized a central management system that creates Etherpad instances and continuously interacts with other Etherpad tools running on Docker containers. This design goes from the monolithic Etherpad instantiation and handling towards a service architecture, where every Etherpad is offered as a microservice. Furthermore, the management system follows (implements) the Observer, Factory Method, Proxy, and Service Layerpopular design patterns. This allows users to gain more privacy through access to validations and shared resources. Our results indicate both the correct operation in the automation of containers’ creation for new users who register in the system and quantifiable improvement in performance.The funding of this research is provided by the Mobility Regulation of the Universidad de las Fuerzas Armadas ESPE, from Sangolquí, Ecuado
Artificial intelligence driven anomaly detection for big data systems
The main goal of this thesis is to contribute to the research on automated performance anomaly detection and interference prediction by implementing Artificial Intelligence (AI) solutions for complex distributed systems, especially for Big Data platforms within cloud computing environments. The late detection and manual resolutions of performance anomalies and system interference in Big Data systems may lead to performance violations and financial penalties. Motivated by this issue, we propose AI-based methodologies for anomaly detection and interference prediction tailored to Big Data and containerized batch platforms to better analyze system performance and effectively utilize computing resources within cloud environments. Therefore, new precise and efficient performance management methods are the key to handling performance anomalies and interference impacts to improve the efficiency of data center resources.
The first part of this thesis contributes to performance anomaly detection for in-memory Big Data platforms. We examine the performance of Big Data platforms and justify our choice of selecting the in-memory Apache Spark platform. An artificial neural network-driven methodology is proposed to detect and classify performance anomalies for batch workloads based on the RDD characteristics and operating system monitoring metrics. Our method is evaluated against other popular machine learning algorithms (ML), as well as against four different monitoring datasets. The results prove that our proposed method outperforms other ML methods, typically achieving 98–99% F-scores. Moreover, we prove that a random start instant, a random duration, and overlapped anomalies do not significantly impact the performance of our proposed methodology.
The second contribution addresses the challenge of anomaly identification within an in-memory streaming Big Data platform by investigating agile hybrid learning techniques. We develop TRACK (neural neTwoRk Anomaly deteCtion in sparK) and TRACK-Plus, two methods to efficiently train a class of machine learning models for performance anomaly detection using a fixed number of experiments. Our model revolves around using artificial neural networks with Bayesian Optimization (BO) to find the optimal training dataset size and configuration parameters to efficiently train the anomaly detection model to achieve high accuracy. The objective is to accelerate the search process for finding the size of the training dataset, optimizing neural network configurations, and improving the performance of anomaly classification. A validation based on several datasets from a real Apache Spark Streaming system is performed, demonstrating that the proposed methodology can efficiently identify performance anomalies, near-optimal configuration parameters, and a near-optimal training dataset size while reducing the number of experiments up to 75% compared with naïve anomaly detection training.
The last contribution overcomes the challenges of predicting completion time of containerized batch jobs and proactively avoiding performance interference by introducing an automated prediction solution to estimate interference among colocated batch jobs within the same computing environment. An AI-driven model is implemented to predict the interference among batch jobs before it occurs within system. Our interference detection model can alleviate and estimate the task slowdown affected by the interference. This model assists the system operators in making an accurate decision to optimize job placement. Our model is agnostic to the business logic internal to each job. Instead, it is learned from system performance data by applying artificial neural networks to establish the completion time prediction of batch jobs within the cloud environments. We compare our model with three other baseline models (queueing-theoretic model, operational analysis, and an empirical method) on historical measurements of job completion time and CPU run-queue size (i.e., the number of active threads in the system). The proposed model captures multithreading, operating system scheduling, sleeping time, and job priorities. A validation based on 4500 experiments based on the DaCapo benchmarking suite was carried out, confirming the predictive efficiency and capabilities of the proposed model by achieving up to 10% MAPE compared with the other models.Open Acces
- …