14,099 research outputs found

    An approach to failure prediction in a cloud based environment

    Get PDF
    yesFailure in a cloud system is defined as an even that occurs when the delivered service deviates from the correct intended behavior. As the cloud computing systems continue to grow in scale and complexity, there is an urgent need for cloud service providers (CSP) to guarantee a reliable on-demand resource to their customers in the presence of faults thereby fulfilling their service level agreement (SLA). Component failures in cloud systems are very familiar phenomena. However, large cloud service providers’ data centers should be designed to provide a certain level of availability to the business system. Infrastructure-as-a-service (Iaas) cloud delivery model presents computational resources (CPU and memory), storage resources and networking capacity that ensures high availability in the presence of such failures. The data in-production-faults recorded within a 2 years period has been studied and analyzed from the National Energy Research Scientific computing center (NERSC). Using the real-time data collected from the Computer Failure Data Repository (CFDR), this paper presents the performance of two machine learning (ML) algorithms, Linear Regression (LR) Model and Support Vector Machine (SVM) with a Linear Gaussian kernel for predicting hardware failures in a real-time cloud environment to improve system availability. The performance of the two algorithms have been rigorously evaluated using K-folds cross-validation technique. Furthermore, steps and procedure for future studies has been presented. This research will aid computer hardware companies and cloud service providers (CSP) in designing a reliable fault-tolerant system by providing a better device selection, thereby improving system availability and minimizing unscheduled system downtime

    Failure prediction using machine learning in a virtualised HPC system and application

    Get PDF
    Failure is an increasingly important issue in high performance computing and cloud systems. As large-scale systems continue to grow in scale and complexity, mitigating the impact of failure and providing accurate predictions with sufficient lead time remains a challenging research problem. Traditional existing fault-tolerance strategies such as regular check-pointing and replication are not adequate because of the emerging complexities of high performance computing systems. This necessitates the importance of having an effective as well as proactive failure management approach in place aimed at minimizing the effect of failure within the system. With the advent of machine learning techniques, the ability to learn from past information to predict future pattern of behaviours makes it possible to predict potential system failure more accurately. Thus, in this paper, we explore the predictive abilities of machine learning by applying a number of algorithms to improve the accuracy of failure prediction. We have developed a failure prediction model using time series and machine learning, and performed comparison based tests on the prediction accuracy. The primary algorithms we considered are the support vector machine (SVM), random forest (RF), k-nearest neighbors (KNN), classification and regression trees (CART) and linear discriminant analysis (LDA). Experimental results indicates that the average prediction accuracy of our model using SVM when predicting failure is 90% accurate and effective compared to other algorithms. This finding implies that our method can effectively predict all possible future system and application failures within the system

    Monitoring resilience in a rook-managed containerized cloud storage system

    Get PDF
    Distributed cloud storage solutions are currently gaining high momentum in industry and academia. The enterprise data volume growth and the recent tendency to move as much as possible data to the cloud is strongly stimulating the storage market growth. In this context, and as a main requirement for cloud native applications, it is of utmost importance to guarantee resilience of the deployed applications and the infrastructure. Indeed, with failures frequently occurring, a storage system should quickly recover to guarantee service availability. In this paper, we focus on containerized cloud storage, proposing a resilience monitoring solution for the recently developed Rook storage operator. While, Rook brings storage systems into a cloud-native container platform, in this paper we design an additional module to monitor and evaluate the resilience of the Rook-based system. Our proposed module is validated in a production environment, with software components generating a constant load and a controlled removal of system elements to evaluate the self-healing capability of the storage system. Failure recovery time revealed to be 41 and 142 seconds on average for a 32GB and a 215GB object storage device respectively

    Uncovering Bugs in Distributed Storage Systems during Testing (not in Production!)

    Get PDF
    Testing distributed systems is challenging due to multiple sources of nondeterminism. Conventional testing techniques, such as unit, integration and stress testing, are ineffective in preventing serious but subtle bugs from reaching production. Formal techniques, such as TLA+, can only verify high-level specifications of systems at the level of logic-based models, and fall short of checking the actual executable code. In this paper, we present a new methodology for testing distributed systems. Our approach applies advanced systematic testing techniques to thoroughly check that the executable code adheres to its high-level specifications, which significantly improves coverage of important system behaviors. Our methodology has been applied to three distributed storage systems in the Microsoft Azure cloud computing platform. In the process, numerous bugs were identified, reproduced, confirmed and fixed. These bugs required a subtle combination of concurrency and failures, making them extremely difficult to find with conventional testing techniques. An important advantage of our approach is that a bug is uncovered in a small setting and witnessed by a full system trace, which dramatically increases the productivity of debugging
    corecore