6,579 research outputs found
A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud
High Performance Computing (HPC) systems have been widely used by scientists and researchers in both industry and university laboratories to solve advanced computation problems. Most advanced computation problems are either data-intensive or computation-intensive. They may take hours, days or even weeks to complete execution. For example, some of the traditional HPC systems computations run on 100,000 processors for weeks. Consequently traditional HPC systems often require huge capital investments. As a result, scientists and researchers sometimes have to wait in long queues to access shared, expensive HPC systems. Cloud computing, on the other hand, offers new computing paradigms, capacity, and flexible solutions for both business and HPC applications. Some of the computation-intensive applications that are usually executed in traditional HPC systems can now be executed in the cloud. Cloud computing price model eliminates huge capital investments. However, even for cloud-based HPC systems, fault tolerance is still an issue of growing concern. The large number of virtual machines and electronic components, as well as software complexity and overall system reliability, availability and serviceability (RAS), are factors with which HPC systems in the cloud must contend. The reactive fault tolerance approach of checkpoint/restart, which is commonly used in HPC systems, does not scale well in the cloud due to resource sharing and distributed systems networks. Hence, the need for reliable fault tolerant HPC systems is even greater in a cloud environment. In this thesis we present a proactive fault tolerance approach to HPC systems in the cloud to reduce the wall-clock execution time, as well as dollar cost, in the presence of hardware failure. We have developed a generic fault tolerance algorithm for HPC systems in the cloud. We have further developed a cost model for executing computation-intensive applications on HPC systems in the cloud. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in the cloud can be considerably reduced compared to checkpoint and redundancy techniques used in traditional HPC systems
Technical Report on Deploying a highly secured OpenStack Cloud Infrastructure using BradStack as a Case Study
Cloud computing has emerged as a popular paradigm and an attractive model for
providing a reliable distributed computing model.it is increasing attracting
huge attention both in academic research and industrial initiatives. Cloud
deployments are paramount for institution and organizations of all scales. The
availability of a flexible, free open source cloud platform designed with no
propriety software and the ability of its integration with legacy systems and
third-party applications are fundamental. Open stack is a free and opensource
software released under the terms of Apache license with a fragmented and
distributed architecture making it highly flexible. This project was initiated
and aimed at designing a secured cloud infrastructure called BradStack, which
is built on OpenStack in the Computing Laboratory at the University of
Bradford. In this report, we present and discuss the steps required in
deploying a secured BradStack Multi-node cloud infrastructure and conducting
Penetration testing on OpenStack Services to validate the effectiveness of the
security controls on the BradStack platform. This report serves as a practical
guideline, focusing on security and practical infrastructure related issues. It
also serves as a reference for institutions looking at the possibilities of
implementing a secured cloud solution.Comment: 38 pages, 19 figures
Proactive Scheduling in Cloud Computing
Autonomic fault aware scheduling is a feature quite important for cloud computing and it is related to adoption of workload variation. In this context, this paper proposes an fault aware pattern matching autonomic scheduling for cloud computing based on autonomic computing concepts. In order to validate the proposed solution, we performed two experiments one with traditional approach and other other with pattern recognition fault aware approach. The results show the effectiveness of the scheme
Checkpointing as a Service in Heterogeneous Cloud Environments
A non-invasive, cloud-agnostic approach is demonstrated for extending
existing cloud platforms to include checkpoint-restart capability. Most cloud
platforms currently rely on each application to provide its own fault
tolerance. A uniform mechanism within the cloud itself serves two purposes: (a)
direct support for long-running jobs, which would otherwise require a custom
fault-tolerant mechanism for each application; and (b) the administrative
capability to manage an over-subscribed cloud by temporarily swapping out jobs
when higher priority jobs arrive. An advantage of this uniform approach is that
it also supports parallel and distributed computations, over both TCP and
InfiniBand, thus allowing traditional HPC applications to take advantage of an
existing cloud infrastructure. Additionally, an integrated health-monitoring
mechanism detects when long-running jobs either fail or incur exceptionally low
performance, perhaps due to resource starvation, and proactively suspends the
job. The cloud-agnostic feature is demonstrated by applying the implementation
to two very different cloud platforms: Snooze and OpenStack. The use of a
cloud-agnostic architecture also enables, for the first time, migration of
applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201
- âŠ