525 research outputs found
Checkpointing as a Service in Heterogeneous Cloud Environments
A non-invasive, cloud-agnostic approach is demonstrated for extending
existing cloud platforms to include checkpoint-restart capability. Most cloud
platforms currently rely on each application to provide its own fault
tolerance. A uniform mechanism within the cloud itself serves two purposes: (a)
direct support for long-running jobs, which would otherwise require a custom
fault-tolerant mechanism for each application; and (b) the administrative
capability to manage an over-subscribed cloud by temporarily swapping out jobs
when higher priority jobs arrive. An advantage of this uniform approach is that
it also supports parallel and distributed computations, over both TCP and
InfiniBand, thus allowing traditional HPC applications to take advantage of an
existing cloud infrastructure. Additionally, an integrated health-monitoring
mechanism detects when long-running jobs either fail or incur exceptionally low
performance, perhaps due to resource starvation, and proactively suspends the
job. The cloud-agnostic feature is demonstrated by applying the implementation
to two very different cloud platforms: Snooze and OpenStack. The use of a
cloud-agnostic architecture also enables, for the first time, migration of
applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201
Automated Anomaly Detection in Virtualized Services Using Deep Packet Inspection
Virtualization technologies have proven to be important drivers for the fast and cost-efficient development and deployment of services. While the benefits are tremendous, there are many challenges to be faced when developing or porting services to virtualized infrastructure. Especially critical applications like Virtualized Network Functions must meet high requirements in terms of reliability and resilience. An important tool when meeting such requirements is detecting anomalous system components and recovering the anomaly before it turns into a fault and subsequently into a failure visible to the client. Anomaly detection for virtualized services relies on collecting system metrics that represent the normal operation state of every component and allow the usage of machine learning algorithms to automatically build models representing such state. This paper presents an approach for collecting service-layer metrics while treating services as black-boxes. This allows service providers to implement anomaly detection on the application layer without the need to modify third-party software. Deep Packet Inspection is used to analyse the traffic of virtual machines on the hypervisor layer, producing both generic and protocol-specific communication metrics. An evaluation shows that the resulting metrics represent the normal operation state of an example Virtualized Network Function and are therefore a valuable contribution to automatic anomaly detection in virtualized services
Enhancing Failure Propagation Analysis in Cloud Computing Systems
In order to plan for failure recovery, the designers of cloud systems need to
understand how their system can potentially fail. Unfortunately, analyzing the
failure behavior of such systems can be very difficult and time-consuming, due
to the large volume of events, non-determinism, and reuse of third-party
components. To address these issues, we propose a novel approach that joins
fault injection with anomaly detection to identify the symptoms of failures. We
evaluated the proposed approach in the context of the OpenStack cloud computing
platform. We show that our model can significantly improve the accuracy of
failure analysis in terms of false positives and negatives, with a low
computational cost.Comment: 12 pages, The 30th International Symposium on Software Reliability
Engineering (ISSRE 2019
Dependability of the NFV Orchestrator: State of the Art and Research Challenges
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.The introduction of network function virtualisation (NFV) represents a significant change in networking technology, which may create new opportunities in terms of cost efficiency, operations, and service provisioning. Although not explicitly stated as an objective, the dependability of the services provided using this technology should be at least as good as conventional solutions. Logical centralisation, off-the-shelf computing platforms, and increased system complexity represent new dependability challenges relative to the state of the art. The core function of the network, with respect to failure and service management, is orchestration. The failure and misoperation of the NFV orchestrator (NFVO) will have huge network-wide consequences. At the same time, NFVO is vulnerable to overload and design faults. Thus, the objective of this paper is to give a tutorial on the dependability challenges of the NFVO, and to give insight into the required future research. This paper provides necessary background information, reviews the available literature, outlines the proposed solutions, and identifies some design and research problems that must be addressed.acceptedVersio
- …