4,086 research outputs found

    A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud

    Get PDF
    High Performance Computing (HPC) systems have been widely used by scientists and researchers in both industry and university laboratories to solve advanced computation problems. Most advanced computation problems are either data-intensive or computation-intensive. They may take hours, days or even weeks to complete execution. For example, some of the traditional HPC systems computations run on 100,000 processors for weeks. Consequently traditional HPC systems often require huge capital investments. As a result, scientists and researchers sometimes have to wait in long queues to access shared, expensive HPC systems. Cloud computing, on the other hand, offers new computing paradigms, capacity, and flexible solutions for both business and HPC applications. Some of the computation-intensive applications that are usually executed in traditional HPC systems can now be executed in the cloud. Cloud computing price model eliminates huge capital investments. However, even for cloud-based HPC systems, fault tolerance is still an issue of growing concern. The large number of virtual machines and electronic components, as well as software complexity and overall system reliability, availability and serviceability (RAS), are factors with which HPC systems in the cloud must contend. The reactive fault tolerance approach of checkpoint/restart, which is commonly used in HPC systems, does not scale well in the cloud due to resource sharing and distributed systems networks. Hence, the need for reliable fault tolerant HPC systems is even greater in a cloud environment. In this thesis we present a proactive fault tolerance approach to HPC systems in the cloud to reduce the wall-clock execution time, as well as dollar cost, in the presence of hardware failure. We have developed a generic fault tolerance algorithm for HPC systems in the cloud. We have further developed a cost model for executing computation-intensive applications on HPC systems in the cloud. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in the cloud can be considerably reduced compared to checkpoint and redundancy techniques used in traditional HPC systems

    Reliable and energy efficient resource provisioning in cloud computing systems

    Get PDF
    Cloud Computing has revolutionized the Information Technology sector by giving computing a perspective of service. The services of cloud computing can be accessed by users not knowing about the underlying system with easy-to-use portals. To provide such an abstract view, cloud computing systems have to perform many complex operations besides managing a large underlying infrastructure. Such complex operations confront service providers with many challenges such as security, sustainability, reliability, energy consumption and resource management. Among all the challenges, reliability and energy consumption are two key challenges focused on in this thesis because of their conflicting nature. Current solutions either focused on reliability techniques or energy efficiency methods. But it has been observed that mechanisms providing reliability in cloud computing systems can deteriorate the energy consumption. Adding backup resources and running replicated systems provide strong fault tolerance but also increase energy consumption. Reducing energy consumption by running resources on low power scaling levels or by reducing the number of active but idle sitting resources such as backup resources reduces the system reliability. This creates a critical trade-off between these two metrics that are investigated in this thesis. To address this problem, this thesis presents novel resource management policies which target the provisioning of best resources in terms of reliability and energy efficiency and allocate them to suitable virtual machines. A mathematical framework showing interplay between reliability and energy consumption is also proposed in this thesis. A formal method to calculate the finishing time of tasks running in a cloud computing environment impacted with independent and correlated failures is also provided. The proposed policies adopted various fault tolerance mechanisms while satisfying the constraints such as task deadlines and utility values. This thesis also provides a novel failure-aware VM consolidation method, which takes the failure characteristics of resources into consideration before performing VM consolidation. All the proposed resource management methods are evaluated by using real failure traces collected from various distributed computing sites. In order to perform the evaluation, a cloud computing framework, 'ReliableCloudSim' capable of simulating failure-prone cloud computing systems is developed. The key research findings and contributions of this thesis are: 1. If the emphasis is given only to energy optimization without considering reliability in a failure prone cloud computing environment, the results can be contrary to the intuitive expectations. Rather than reducing energy consumption, a system ends up consuming more energy due to the energy losses incurred because of failure overheads. 2. While performing VM consolidation in a failure prone cloud computing environment, a significant improvement in terms of energy efficiency and reliability can be achieved by considering failure characteristics of physical resources. 3. By considering correlated occurrence of failures during resource provisioning and VM allocation, the service downtime or interruption is reduced significantly by 34% in comparison to the environments with the assumption of independent occurrence of failures. Moreover, measured by our mathematical model, the ratio of reliability and energy consumption is improved by 14%

    Efficient data reliability management of cloud storage systems for big data applications

    Get PDF
    Cloud service providers are consistently striving to provide efficient and reliable service, to their client's Big Data storage need. Replication is a simple and flexible method to ensure reliability and availability of data. However, it is not an efficient solution for Big Data since it always scales in terabytes and petabytes. Hence erasure coding is gaining traction despite its shortcomings. Deploying erasure coding in cloud storage confronts several challenges like encoding/decoding complexity, load balancing, exponential resource consumption due to data repair and read latency. This thesis has addressed many challenges among them. Even though data durability and availability should not be compromised for any reason, client's requirements on read performance (access latency) may vary with the nature of data and its access pattern behaviour. Access latency is one of the important metrics and latency acceptance range can be recorded in the client's SLA. Several proactive recovery methods, for erasure codes are proposed in this research, to reduce resource consumption due to recovery. Also, a novel cache based solution is proposed to mitigate the access latency issue of erasure coding

    Edge Computing for Extreme Reliability and Scalability

    Get PDF
    The massive number of Internet of Things (IoT) devices and their continuous data collection will lead to a rapid increase in the scale of collected data. Processing all these collected data at the central cloud server is inefficient, and even is unfeasible or unnecessary. Hence, the task of processing the data is pushed to the network edges introducing the concept of Edge Computing. Processing the information closer to the source of data (e.g., on gateways and on edge micro-servers) not only reduces the huge workload of central cloud, also decreases the latency for real-time applications by avoiding the unreliable and unpredictable network latency to communicate with the central cloud

    Opportunities of IoT in Fog Computing for High Fault Tolerance and Sustainable Energy Optimization

    Get PDF
    Today, the importance of enhanced quality of service and energy optimization has promoted research into sensor applications such as pervasive health monitoring, distributed computing, etc. In general, the resulting sensor data are stored on the cloud server for future processing. For this purpose, recently, the use of fog computing from a real-world perspective has emerged, utilizing end-user nodes and neighboring edge devices to perform computation and communication. This paper aims to develop a quality-of-service-based energy optimization (QoS-EO) scheme for the wireless sensor environments deployed in fog computing. The fog nodes deployed in specific geographical areas cover the sensor activity performed in those areas. The logical situation of the entire system is informed by the fog nodes, as portrayed. The implemented techniques enable services in a fog-collaborated WSN environment. Thus, the proposed scheme performs quality-of-service placement and optimizes the network energy. The results show a maximum turnaround time of 8 ms, a minimum turnaround time of 1 ms, and an average turnaround time of 3 ms. The costs that were calculated indicate that as the number of iterations increases, the path cost value decreases, demonstrating the efficacy of the proposed technique. The CPU execution delay was reduced to a minimum of 0.06 s. In comparison, the proposed QoS-EO scheme has a lower network usage of 611,643.3 and a lower execution cost of 83,142.2. Thus, the results show the best cost estimation, reliability, and performance of data transfer in a short time, showing a high level of network availability, throughput, and performance guarantee

    Data Replication and Its Alignment with Fault Management in the Cloud Environment

    Get PDF
    Nowadays, the exponential data growth becomes one of the major challenges all over the world. It may cause a series of negative impacts such as network overloading, high system complexity, and inadequate data security, etc. Cloud computing is developed to construct a novel paradigm to alleviate massive data processing challenges with its on-demand services and distributed architecture. Data replication has been proposed to strategically distribute the data access load to multiple cloud data centres by creating multiple data copies at multiple cloud data centres. A replica-applied cloud environment not only achieves a decrease in response time, an increase in data availability, and more balanced resource load but also protects the cloud environment against the upcoming faults. The reactive fault tolerance strategy is also required to handle the faults when the faults already occurred. As a result, the data replication strategies should be aligned with the reactive fault tolerance strategies to achieve a complete management chain in the cloud environment. In this thesis, a data replication and fault management framework is proposed to establish a decentralised overarching management to the cloud environment. Three data replication strategies are firstly proposed based on this framework. A replica creation strategy is proposed to reduce the total cost by jointly considering the data dependency and the access frequency in the replica creation decision making process. Besides, a cloud map oriented and cost efficiency driven replica creation strategy is proposed to achieve the optimal cost reduction per replica in the cloud environment. The local data relationship and the remote data relationship are further analysed by creating two novel data dependency types, Within-DataCentre Data Dependency and Between-DataCentre Data Dependency, according to the data location. Furthermore, a network performance based replica selection strategy is proposed to avoid potential network overloading problems and to increase the number of concurrent-running instances at the same time
    corecore