8 research outputs found

    Enhancing Failure Propagation Analysis in Cloud Computing Systems

    Full text link
    In order to plan for failure recovery, the designers of cloud systems need to understand how their system can potentially fail. Unfortunately, analyzing the failure behavior of such systems can be very difficult and time-consuming, due to the large volume of events, non-determinism, and reuse of third-party components. To address these issues, we propose a novel approach that joins fault injection with anomaly detection to identify the symptoms of failures. We evaluated the proposed approach in the context of the OpenStack cloud computing platform. We show that our model can significantly improve the accuracy of failure analysis in terms of false positives and negatives, with a low computational cost.Comment: 12 pages, The 30th International Symposium on Software Reliability Engineering (ISSRE 2019

    Fault Injection Analytics: A Novel Approach to Discover Failure Modes in Cloud-Computing Systems

    Full text link
    Cloud computing systems fail in complex and unexpected ways due to unexpected combinations of events and interactions between hardware and software components. Fault injection is an effective means to bring out these failures in a controlled environment. However, fault injection experiments produce massive amounts of data, and manually analyzing these data is inefficient and error-prone, as the analyst can miss severe failure modes that are yet unknown. This paper introduces a new paradigm (fault injection analytics) that applies unsupervised machine learning on execution traces of the injected system, to ease the discovery and interpretation of failure modes. We evaluated the proposed approach in the context of fault injection experiments on the OpenStack cloud computing platform, where we show that the approach can accurately identify failure modes with a low computational cost.Comment: IEEE Transactions on Dependable and Secure Computing; 16 pages. arXiv admin note: text overlap with arXiv:1908.1164

    Repeated Builds During Code Review: An Empirical Study of the OpenStack Community

    Full text link
    Code review is a popular practice where developers critique each others' changes. Since automated builds can identify low-level issues (e.g., syntactic errors, regression bugs), it is not uncommon for software organizations to incorporate automated builds in the code review process. In such code review deployment scenarios, submitted change sets must be approved for integration by both peer code reviewers and automated build bots. Since automated builds may produce an unreliable signal of the status of a change set (e.g., due to ``flaky'' or non-deterministic execution behaviour), code review tools, such as Gerrit, allow developers to request a ``recheck'', which repeats the build process without updating the change set. We conjecture that an unconstrained recheck command will waste time and resources if it is not applied judiciously. To explore how the recheck command is applied in a practical setting, in this paper, we conduct an empirical study of 66,932 code reviews from the OpenStack community. We quantitatively analyze (i) how often build failures are rechecked; (ii) the extent to which invoking recheck changes build failure outcomes; and (iii) how much waste is generated by invoking recheck. We observe that (i) 55% of code reviews invoke the recheck command after a failing build is reported; (ii) invoking the recheck command only changes the outcome of a failing build in 42% of the cases; and (iii) invoking the recheck command increases review waiting time by an average of 2,200% and equates to 187.4 compute years of waste -- enough compute resources to compete with the oldest land living animal on earth.Comment: conferenc

    Hybrid clouds for data-Intensive, 5G-Enabled IoT applications: an overview, key issues and relevant architecture

    Get PDF
    Hybrid cloud multi-access edge computing (MEC) deployments have been proposed as efficient means to support Internet of Things (IoT) applications, relying on a plethora of nodes and data. In this paper, an overview on the area of hybrid clouds considering relevant research areas is given, providing technologies and mechanisms for the formation of such MEC deployments, as well as emphasizing several key issues that should be tackled by novel approaches, especially under the 5G paradigm. Furthermore, a decentralized hybrid cloud MEC architecture, resulting in a Platform-as-a-Service (PaaS) is proposed and its main building blocks and layers are thoroughly described. Aiming to offer a broad perspective on the business potential of such a platform, the stakeholder ecosystem is also analyzed. Finally, two use cases in the context of smart cities and mobile health are presented, aimed at showing how the proposed PaaS enables the development of respective IoT applications.Peer ReviewedPostprint (published version

    Data Labeling for Fault Detection in Cloud: A Test Suite-Based Active Learning Approach

    Get PDF
    Cloud computing enables ubiquitous on-demand network access to a shared pool of configurable computing resources with minimal management efforts from the user. It has evolved as a key computing paradigm to enable a wide variety of applications such as e-commerce, social networks, high-performance computing, mission-critical applications, and Internet of Things (IoT). Ensuring the quality of service of applications deployed in inherently complex and fault-prone cloud environments is of utmost concern to service providers and end users. Machine learning-based fault management solutions enable proactive identification and mitigation of faults in cloud environments to attain the desired reliability, though they require labeled cloud metrics data for training and evaluation. Moreover, the high dynamicity in cloud environments brings forth emerging data distributions, which necessitate frequent labeling of cloud metrics data stemming from an evolving data distribution for model adaptation. In this thesis, we study the problem of data labeling for fault detection in cloud environments, paying close attention to the phenomenon of evolving cloud metric data distributions. More specifically, we propose a test suite-based active learning framework for automated labeling of cloud metrics data with the corresponding cloud system state while accounting for emerging fault patterns and data or concept drifts. We implemented our solution on a cloud testbed and introduced various emerging data distribution scenarios to evaluate the proposed framework's labeling efficacy over known and emerging data distributions. According to our evaluation results, the proposed framework achieves about 41% higher weighted F1-score and 34% higher average Area Under One-vs-Rest Receiver Operating Characteristic curves (OvR ROC AUC score) than a system without any adaptation for emerging data distributions

    Next generation control of transport networks

    Get PDF
    It is widely understood by telecom operators and industry analysts that bandwidth demand is increasing dramatically, year on year, with typical growth figures of 50% for Internet-based traffic [5]. This trend means that the consumers will have both a wide variety of devices attaching to their networks and a range of high bandwidth service requirements. The corresponding impact is the effect on the traffic engineered network (often referred to as the “transport network”) to ensure that the current rate of growth of network traffic is supported and meets predicted future demands. As traffic demands increase and newer services continuously arise, novel network elements are needed to provide more flexibility, scalability, resilience, and adaptability to today’s transport network. The transport network provides transparent traffic engineered communication of user, application, and device traffic between attached clients (software and hardware) and establishing and maintaining point-to-point or point-to-multipoint connections. The research documented in this thesis was based on three initial research questions posed while performing research at British Telecom research labs and investigating control of transport networks of future transport networks: 1. How can we meet Internet bandwidth growth yet minimise network costs? 2. Which enabling network technologies might be leveraged to control network layers and functions cooperatively, instead of separated network layer and technology control? 3. Is it possible to utilise both centralised and distributed control mechanisms for automation and traffic optimisation? This thesis aims to provide the classification, motivation, invention, and evolution of a next generation control framework for transport networks, and special consideration of delivering broadcast video traffic to UK subscribers. The document outlines pertinent telecoms technology and current art, how requirements I gathered, and research I conducted, and by which the transport control framework functional components are identified and selected, and by which method the architecture was implemented and applied to key research projects requiring next generation control capabilities, both at British Telecom and the wider research community. Finally, in the closing chapters, the thesis outlines the next steps for ongoing research and development of the transport network framework and key areas for further study

    Understanding and fulfilling information needs of stakeholders along product lifecycles: Applying data analytics in product life-cycle management

    Get PDF
    Improving the management of product lifecycle is becoming increasingly important today. The concept of Closed Loop Product Lifecycle Management (CL-PLM) has been developed to help organizations in the value chain to promote the comprehensive management of product lifecycle information and activities. CL-PLM attempts to manage all processes related to products, from product design to end of product life. However, the development of this concept is not still complete. There is a need for more research to manage the data and information flows and to provide each organization in the value chain with the right data from product Middle Of the Life (MOL). In this thesis, a new approach is presented, which completes the concept of CL-PLM in terms of information requirements of product beneficiaries in MOL of the product. The research question is as follows: Which groups and organizations use the products or are affected by it and can be benefited from the product's operational information? The research involves the determination of the information needs of the stakeholders and tests the automatic provision of information to stakeholder organizations (beneficiaries) using data analytics methods. The methodology used in this thesis is an inductive approach. To this end, two case studies on engineered products are carried out. Based on these two studies, the stakeholders involved in MOL, which benefit from the product and its information, are identified. Next, the dissertation identifies current and future information needs. Interviews and surveys have been used to identify information that MOL stakeholders need. The second part of this thesis examines, which data analytics methods can be useful for supporting each information need. After the completion of analysis, the findings are tested and implemented in real-world scenarios. The scenarios show how these information requirements can be satisfied with data analytics. The findings of this research are various. First is to find out the MOL stakeholders of the product lifecycle and grouping the most influential ones, which are affected by product MOL data and information. Moreover, identifying their current and potential future information needs from the product MOL. Second is a classification of data analytics tools (suggestions on suitable tools) that can help the stakeholders to meet their new information needs from the product, or improve their access to product MOL information. The third part of the results is the implementation of the mentioned concept in three scenarios. In implementation part, information needs of OEM about the performance of a component in an electric vehicle, information need of maintenance provider from motorboat in respect of spare part supply and information need of wind farm operator from windmills for contact design are selected and modeled. The latter shows that it is possible to automatize the provision of the required information to the stakeholders. These three contributions form the output of this dissertation. In sum, the findings of this thesis can be helpful for product and service manufacturers, especially those who want to move from pure production to the production of a product with the services, and organizations, who wish to add intelligence to their products. This research can also be useful for PLM software developers. Finally, it is helpful for all companies involved in the realization of engineered products that want to understand and manage the lifecycle of the product

    Efficient Failure Diagnosis of OpenStack Using Tempest

    No full text
    corecore