25 research outputs found
An Adaptive Resilience Testing Framework for Microservice Systems
Resilience testing, which measures the ability to minimize service
degradation caused by unexpected failures, is crucial for microservice systems.
The current practice for resilience testing relies on manually defining rules
for different microservice systems. Due to the diverse business logic of
microservices, there are no one-size-fits-all microservice resilience testing
rules. As the quantity and dynamic of microservices and failures largely
increase, manual configuration exhibits its scalability and adaptivity issues.
To overcome the two issues, we empirically compare the impacts of common
failures in the resilient and unresilient deployments of a benchmark
microservice system. Our study demonstrates that the resilient deployment can
block the propagation of degradation from system performance metrics (e.g.,
memory usage) to business metrics (e.g., response latency). In this paper, we
propose AVERT, the first AdaptiVE Resilience Testing framework for microservice
systems. AVERT first injects failures into microservices and collects available
monitoring metrics. Then AVERT ranks all the monitoring metrics according to
their contributions to the overall service degradation caused by the injected
failures. Lastly, AVERT produces a resilience index by how much the degradation
in system performance metrics propagates to the degradation in business
metrics. The higher the degradation propagation, the lower the resilience of
the microservice system. We evaluate AVERT on two open-source benchmark
microservice systems. The experimental results show that AVERT can accurately
and efficiently test the resilience of microservice systems
Prism: Revealing Hidden Functional Clusters from Massive Instances in Cloud Systems
Ensuring the reliability of cloud systems is critical for both cloud vendors
and customers. Cloud systems often rely on virtualization techniques to create
instances of hardware resources, such as virtual machines. However,
virtualization hinders the observability of cloud systems, making it
challenging to diagnose platform-level issues. To improve system observability,
we propose to infer functional clusters of instances, i.e., groups of instances
having similar functionalities. We first conduct a pilot study on a large-scale
cloud system, i.e., Huawei Cloud, demonstrating that instances having similar
functionalities share similar communication and resource usage patterns.
Motivated by these findings, we formulate the identification of functional
clusters as a clustering problem and propose a non-intrusive solution called
Prism. Prism adopts a coarse-to-fine clustering strategy. It first partitions
instances into coarse-grained chunks based on communication patterns. Within
each chunk, Prism further groups instances with similar resource usage patterns
to produce fine-grained functional clusters. Such a design reduces noises in
the data and allows Prism to process massive instances efficiently. We evaluate
Prism on two datasets collected from the real-world production environment of
Huawei Cloud. Our experiments show that Prism achieves a v-measure of ~0.95,
surpassing existing state-of-the-art solutions. Additionally, we illustrate the
integration of Prism within monitoring systems for enhanced cloud reliability
through two real-world use cases.Comment: The paper was accepted by the 38th IEEE/ACM International Conference
on Automated Software Engineering (ASE 2023
Understanding Persistent-Memory Related Issues in the Linux Kernel
Persistent memory (PM) technologies have inspired a wide range of PM-based
system optimizations. However, building correct PM-based systems is difficult
due to the unique characteristics of PM hardware. To better understand the
challenges as well as the opportunities to address them, this paper presents a
comprehensive study of PM-related issues in the Linux kernel. By analyzing
1,553 PM-related kernel patches in-depth and conducting experiments on
reproducibility and tool extension, we derive multiple insights in terms of PM
patch categories, PM bug patterns, consequences, fix strategies, triggering
conditions, and remedy solutions. We hope our results could contribute to the
development of robust PM-based storage systemsComment: ACM TRANSACTIONS ON STORAGE(TOS'23