814 research outputs found

    Proactive Fault Tolerance Through Cloud Failure Prediction Using Machine Learning

    Get PDF
    One of the crucial aspects of cloud infrastructure is fault tolerance, and its primary responsibility is to address the situations that arise when different architectural parts fail. A sizeable cloud data center must deliver high service dependability and availability while minimizing failure incidence. However, modern large cloud data centers continue to have significant failure rates owing to a variety of factors, including hardware and software faults, which often lead to task and job failures. To reduce unexpected loss, it is critical to forecast task or job failures with high accuracy before they occur. This research examines the performance of four machine learning (ML) algorithms for forecasting failure in a real-time cloud environment to increase system availability using real-time data gathered from the Google Cluster Workload Traces 2019. We applied four distinct supervised machine learning algorithms are logistic regression, KNN, SVM, decision tree, and logistic regression classifiers. Confusion matrices as well as ROC curves were used to assess the reliability and robustness of each algorithm. This study will assist cloud service providers developing a robust fault tolerance design by optimizing device selection, consequently boosting system availability and eliminating unexpected system downtime

    Data Replication and Its Alignment with Fault Management in the Cloud Environment

    Get PDF
    Nowadays, the exponential data growth becomes one of the major challenges all over the world. It may cause a series of negative impacts such as network overloading, high system complexity, and inadequate data security, etc. Cloud computing is developed to construct a novel paradigm to alleviate massive data processing challenges with its on-demand services and distributed architecture. Data replication has been proposed to strategically distribute the data access load to multiple cloud data centres by creating multiple data copies at multiple cloud data centres. A replica-applied cloud environment not only achieves a decrease in response time, an increase in data availability, and more balanced resource load but also protects the cloud environment against the upcoming faults. The reactive fault tolerance strategy is also required to handle the faults when the faults already occurred. As a result, the data replication strategies should be aligned with the reactive fault tolerance strategies to achieve a complete management chain in the cloud environment. In this thesis, a data replication and fault management framework is proposed to establish a decentralised overarching management to the cloud environment. Three data replication strategies are firstly proposed based on this framework. A replica creation strategy is proposed to reduce the total cost by jointly considering the data dependency and the access frequency in the replica creation decision making process. Besides, a cloud map oriented and cost efficiency driven replica creation strategy is proposed to achieve the optimal cost reduction per replica in the cloud environment. The local data relationship and the remote data relationship are further analysed by creating two novel data dependency types, Within-DataCentre Data Dependency and Between-DataCentre Data Dependency, according to the data location. Furthermore, a network performance based replica selection strategy is proposed to avoid potential network overloading problems and to increase the number of concurrent-running instances at the same time
    • …
    corecore