47 research outputs found
Configuration Validation with Large Language Models
Misconfigurations are major causes of software failures. Existing practices
rely on developer-written rules or test cases to validate configurations, which
are expensive. Machine learning (ML) for configuration validation is considered
a promising direction, but has been facing challenges such as the need of
large-scale field data and system-specific models. Recent advances in Large
Language Models (LLMs) show promise in addressing some of the long-lasting
limitations of ML-based configuration validation. We present a first analysis
on the feasibility and effectiveness of using LLMs for configuration
validation. We empirically evaluate LLMs as configuration validators by
developing a generic LLM-based configuration validation framework, named Ciri.
Ciri employs effective prompt engineering with few-shot learning based on both
valid configuration and misconfiguration data. Ciri checks outputs from LLMs
when producing results, addressing hallucination and nondeterminism of LLMs. We
evaluate Ciri's validation effectiveness on eight popular LLMs using
configuration data of ten widely deployed open-source systems. Our analysis (1)
confirms the potential of using LLMs for configuration validation, (2) explores
design space of LLMbased validators like Ciri, and (3) reveals open challenges
such as ineffectiveness in detecting certain types of misconfigurations and
biases towards popular configuration parameters
Automated Implementation of Windows-related Security-Configuration Guides
Hardening is the process of configuring IT systems to ensure the security of
the systems' components and data they process or store. The complexity of
contemporary IT infrastructures, however, renders manual security hardening and
maintenance a daunting task.
In many organizations, security-configuration guides expressed in the SCAP
(Security Content Automation Protocol) are used as a basis for hardening, but
these guides by themselves provide no means for automatically implementing the
required configurations.
In this paper, we propose an approach to automatically extract the relevant
information from publicly available security-configuration guides for Windows
operating systems using natural language processing. In a second step, the
extracted information is verified using the information of available settings
stored in the Windows Administrative Template files, in which the majority of
Windows configuration settings is defined.
We show that our implementation of this approach can extract and implement
83% of the rules without any manual effort and 96% with minimal manual effort.
Furthermore, we conduct a study with 12 state-of-the-art guides consisting of
2014 rules with automatic checks and show that our tooling can implement at
least 97% of them correctly. We have thus significantly reduced the effort of
securing systems based on existing security-configuration guides
Automatic Root Cause Analysis via Large Language Models for Cloud Incidents
Ensuring the reliability and availability of cloud services necessitates
efficient root cause analysis (RCA) for cloud incidents. Traditional RCA
methods, which rely on manual investigations of data sources such as logs and
traces, are often laborious, error-prone, and challenging for on-call
engineers. In this paper, we introduce RCACopilot, an innovative on-call system
empowered by the large language model for automating RCA of cloud incidents.
RCACopilot matches incoming incidents to corresponding incident handlers based
on their alert types, aggregates the critical runtime diagnostic information,
predicts the incident's root cause category, and provides an explanatory
narrative. We evaluate RCACopilot using a real-world dataset consisting of a
year's worth of incidents from Microsoft. Our evaluation demonstrates that
RCACopilot achieves RCA accuracy up to 0.766. Furthermore, the diagnostic
information collection component of RCACopilot has been successfully in use at
Microsoft for over four years
Hardening Cloud and Datacenter Systems against Misconfigurations: Principles and Tool Support
Misconfigurations (a.k.a., configuration errors from a system’s standpoint) are among the dominant causes of today’s catastrophic system failures that turn down cloud-scale services and affect hundreds of millions of end users. Despite their wide adoption, traditional fault-tolerance and failure-recovery techniques are not effective in dealing with configuration errors, especially in large-scale software systems deployed in cloud and datacenters. To make the matters worse, even the tolerance and recovery mechanisms themselves are often misconfigured in the real world, which impairs the immune system of the entire cloud and datacenters.This dissertation explores two fundamental questions towards the solutions for the inevitable misconfigurations—how to build reliable cloud and datacenter systems in the face of configuration errors; moreover, how to prevent misconfigurations in the first place by better configuration design. The goal is to enable software systems to proactively anticipate and defend against misconfigurations, rather than reacting to their manifestations and consequences. This dissertation presents three key principles of systems design and implementation for hardening cloud and datacenter systems against misconfigurations—anticipating misconfigurations, early detection of configuration errors, and simplicity-oriented configuration design. The dissertation demonstrates that applying these principles can effectively defend cloud and datacenter systems against misconfigurations. Moreover, the dissertation presents the corresponding techniques and tool support that can automatically and systematically apply these principles to existing systems software. The main technical insight is that configurations are essentially used by the systems, while configuration errors are mostly manifested through the faulty execution that uses erroneous configuration values. Therefore, by analyzing the system’s code that usesconfiguration values, one can understand and make use of system-level information of configurations to build defense against potential errors. This dissertation first presents Spex that enables systems to anticipate misconfigurations. Spex automatically infers configuration constraints from a system’s source code, and then leverages the constraints to test the system’s resilience to misconfigurations and detect error-prone configuration design/handling. On step further, the dissertation introduces PCheck to automatically generate checking code which captures configuration errors at the system’s initialization phase to prevent their late manifestations and the corresponding failure damage.Going beyond, this dissertation presents simplicity-oriented configuration design towards more usable and less error-prone software configuration. The key idea is to apply the user-centric philosophy to design configuration as an interface—configurations are essentially the interface for controlling and customizing system behavior, but have rarely been treated as it is. The dissertation shows that configurations in today’s systems software can be significantly simplified and effectively navigated, with the understanding of how they are actually used in the field