Search CORE

25 research outputs found

An Adaptive Resilience Testing Framework for Microservice Systems

Author: Lee Cheryl
Lyu Michael R.
Shen Jiacheng
Su Yuxin
Yang Tianyi
Yang Yongqiang
Publication venue
Publication date: 24/12/2022
Field of study

Resilience testing, which measures the ability to minimize service degradation caused by unexpected failures, is crucial for microservice systems. The current practice for resilience testing relies on manually defining rules for different microservice systems. Due to the diverse business logic of microservices, there are no one-size-fits-all microservice resilience testing rules. As the quantity and dynamic of microservices and failures largely increase, manual configuration exhibits its scalability and adaptivity issues. To overcome the two issues, we empirically compare the impacts of common failures in the resilient and unresilient deployments of a benchmark microservice system. Our study demonstrates that the resilient deployment can block the propagation of degradation from system performance metrics (e.g., memory usage) to business metrics (e.g., response latency). In this paper, we propose AVERT, the first AdaptiVE Resilience Testing framework for microservice systems. AVERT first injects failures into microservices and collects available monitoring metrics. Then AVERT ranks all the monitoring metrics according to their contributions to the overall service degradation caused by the injected failures. Lastly, AVERT produces a resilience index by how much the degradation in system performance metrics propagates to the degradation in business metrics. The higher the degradation propagation, the lower the resilience of the microservice system. We evaluate AVERT on two open-source benchmark microservice systems. The experimental results show that AVERT can accurately and efficiently test the resilience of microservice systems

arXiv.org e-Print Archive

Prism: Revealing Hidden Functional Clusters from Massive Instances in Cloud Systems

Author: Chen Zhuangbin
Feng Cong
Gu Jiazhen
Huang Junjie
Jiang Zhihan
Liu Jinyang
Lyu Michael R.
Yang Yongqiang
Yang Zengyin
Publication venue
Publication date: 15/08/2023
Field of study

Ensuring the reliability of cloud systems is critical for both cloud vendors and customers. Cloud systems often rely on virtualization techniques to create instances of hardware resources, such as virtual machines. However, virtualization hinders the observability of cloud systems, making it challenging to diagnose platform-level issues. To improve system observability, we propose to infer functional clusters of instances, i.e., groups of instances having similar functionalities. We first conduct a pilot study on a large-scale cloud system, i.e., Huawei Cloud, demonstrating that instances having similar functionalities share similar communication and resource usage patterns. Motivated by these findings, we formulate the identification of functional clusters as a clustering problem and propose a non-intrusive solution called Prism. Prism adopts a coarse-to-fine clustering strategy. It first partitions instances into coarse-grained chunks based on communication patterns. Within each chunk, Prism further groups instances with similar resource usage patterns to produce fine-grained functional clusters. Such a design reduces noises in the data and allows Prism to process massive instances efficiently. We evaluate Prism on two datasets collected from the real-world production environment of Huawei Cloud. Our experiments show that Prism achieves a v-measure of ~0.95, surpassing existing state-of-the-art solutions. Additionally, we illustrate the integration of Prism within monitoring systems for enhanced cloud reliability through two real-world use cases.Comment: The paper was accepted by the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE 2023

arXiv.org e-Print Archive

Understanding Persistent-Memory Related Issues in the Linux Kernel

Author: Gatla Om Rameshwar
Xu Wei
Zhang Duo
Zheng Mai
Publication venue
Publication date: 09/07/2023
Field of study

Persistent memory (PM) technologies have inspired a wide range of PM-based system optimizations. However, building correct PM-based systems is difficult due to the unique characteristics of PM hardware. To better understand the challenges as well as the opportunities to address them, this paper presents a comprehensive study of PM-related issues in the Linux kernel. By analyzing 1,553 PM-related kernel patches in-depth and conducting experiments on reproducibility and tool extension, we derive multiple insights in terms of PM patch categories, PM bug patterns, consequences, fix strategies, triggering conditions, and remedy solutions. We hope our results could contribute to the development of robust PM-based storage systemsComment: ACM TRANSACTIONS ON STORAGE(TOS'23

arXiv.org e-Print Archive