525 research outputs found

    Checkpointing as a Service in Heterogeneous Cloud Environments

    Get PDF
    A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long-running jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and OpenStack. The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201

    Automated Anomaly Detection in Virtualized Services Using Deep Packet Inspection

    Get PDF
    Virtualization technologies have proven to be important drivers for the fast and cost-efficient development and deployment of services. While the benefits are tremendous, there are many challenges to be faced when developing or porting services to virtualized infrastructure. Especially critical applications like Virtualized Network Functions must meet high requirements in terms of reliability and resilience. An important tool when meeting such requirements is detecting anomalous system components and recovering the anomaly before it turns into a fault and subsequently into a failure visible to the client. Anomaly detection for virtualized services relies on collecting system metrics that represent the normal operation state of every component and allow the usage of machine learning algorithms to automatically build models representing such state. This paper presents an approach for collecting service-layer metrics while treating services as black-boxes. This allows service providers to implement anomaly detection on the application layer without the need to modify third-party software. Deep Packet Inspection is used to analyse the traffic of virtual machines on the hypervisor layer, producing both generic and protocol-specific communication metrics. An evaluation shows that the resulting metrics represent the normal operation state of an example Virtualized Network Function and are therefore a valuable contribution to automatic anomaly detection in virtualized services

    Enhancing Failure Propagation Analysis in Cloud Computing Systems

    Full text link
    In order to plan for failure recovery, the designers of cloud systems need to understand how their system can potentially fail. Unfortunately, analyzing the failure behavior of such systems can be very difficult and time-consuming, due to the large volume of events, non-determinism, and reuse of third-party components. To address these issues, we propose a novel approach that joins fault injection with anomaly detection to identify the symptoms of failures. We evaluated the proposed approach in the context of the OpenStack cloud computing platform. We show that our model can significantly improve the accuracy of failure analysis in terms of false positives and negatives, with a low computational cost.Comment: 12 pages, The 30th International Symposium on Software Reliability Engineering (ISSRE 2019

    Dependability of the NFV Orchestrator: State of the Art and Research Challenges

    Get PDF
    © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.The introduction of network function virtualisation (NFV) represents a significant change in networking technology, which may create new opportunities in terms of cost efficiency, operations, and service provisioning. Although not explicitly stated as an objective, the dependability of the services provided using this technology should be at least as good as conventional solutions. Logical centralisation, off-the-shelf computing platforms, and increased system complexity represent new dependability challenges relative to the state of the art. The core function of the network, with respect to failure and service management, is orchestration. The failure and misoperation of the NFV orchestrator (NFVO) will have huge network-wide consequences. At the same time, NFVO is vulnerable to overload and design faults. Thus, the objective of this paper is to give a tutorial on the dependability challenges of the NFVO, and to give insight into the required future research. This paper provides necessary background information, reviews the available literature, outlines the proposed solutions, and identifies some design and research problems that must be addressed.acceptedVersio
    corecore