440 research outputs found

    Checkpointing as a Service in Heterogeneous Cloud Environments

    Get PDF
    A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long-running jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and OpenStack. The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201

    Adaptive scheduling solution for grid meta-brokering

    Get PDF
    The nearly optimal, interoperable utilization of various grid resources play an important role in the world of grids. Though well-designed, evaluated and widely used resource brokers have been developed, these existing solutions still cannot cope with the high uncertainty ruling current grid systems. To ease the simultaneous utilization of different middleware systems, researchers need to revise current solutions. In this paper we propose advanced scheduling techniques with a weighted fitness function for an adaptive Meta-Brokering Grid Service, which enables a higher level utilization of the existing grid brokers. We also set up a grid simulation environment to demonstrate the efficiency of the proposed meta-level scheduling solution. The presented evaluation results show that the proposed novel scheduling technique in the meta-brokering context delivers better performance

    Developing a distributed electronic health-record store for India

    Get PDF
    The DIGHT project is addressing the problem of building a scalable and highly available information store for the Electronic Health Records (EHRs) of the over one billion citizens of India

    Fault-Tolerant Dynamic Deduplication for Utility Computing

    Get PDF
    Utility computing is an increasingly important paradigm, whereby computing resources are provided on-demand as utilities. An important component of utility computing is storage, data volumes are growing rapidly, and mechanisms to mitigate this growth need to be developed. Data deduplication is a promising technique for drastically reducing the amount of data stored in such system systems, however, current approachs are static in nature, using an amount of redundancy fixed at design time. This is inappropriate for truly dynamic modern systems. We propose a real-time adaptive deduplication system for Cloud and Utility computing that monitors in real-time for changing system, user, and environmental behaviour in order to fulfill a balance between changing storage efficiency, performance, and fault tolerance requirements. We evaluate our system through simulation, with experimental results showing that our system is both efficient and sclable. We also perform experimentation to evaluate the fault tolerance of the system by measuring Mean Time to Repair (MTTR), and using these values to calculate availability of the system. The results show that higher replication levels result in higher system availability, however, the number of files in the system also effects recovery time. We show that the tradeoff between replication levels and recovery time when the system overloads needs further investigation

    Grid-enabling FIRST: Speeding up simulation applications using WinGrid

    Get PDF
    The vision of grid computing is to make computational power, storage capacity, data and applications available to users as readily as electricity and other utilities. Grid infrastructures and applications have traditionally been geared towards dedicated, centralized, high performance clusters running on UNIX flavour operating systems (commonly referred to as cluster-based grid computing). This can be contrasted with desktop-based grid computing which refers to the aggregation of non-dedicated, de-centralized, commodity PCs connected through a network and running (mostly) the Microsoft Windowstrade operating system. Large scale adoption of such Windowstrade-based grid infrastructure may be facilitated via grid-enabling existing Windows applications. This paper presents the WinGridtrade approach to grid enabling existing Windowstrade based commercial-off-the-shelf (COTS) simulation packages (CSPs). Through the use of a case study developed in conjunction with Ford Motor Company, the paper demonstrates how experimentation with the CSP Witnesstrade and FIRST can achieve a linear speedup when WinGridtrade is used to harness idle PC computing resources. This, combined with the lessons learned from the case study, has encouraged us to develop the Web service extensions to WinGridtrade. It is hoped that this would facilitate wider acceptance of WinGridtrade among enterprises having stringent security policies in place

    Towards low cost prototyping of mobile opportunistic disconnection tolerant networks and systems

    Get PDF
    Fast emerging mobile edge computing, mobile clouds, Internet of Things (IoT) and cyber physical systems require many novel realistic real time multi-layer algorithms for a wide range of domains, such as intelligent content provision and processing, smart transport, smart manufacturing systems and mobile end user applications. This paper proposes a low cost open source platform, MODiToNeS, which uses commodity hardware to support prototyping and testing of fully distributed multi-layer complex algorithms over real world (or pseudo real) traces. MODiToNeS platform is generic and comprises multiple interfaces that allow real time topology and mobility control, deployment and analysis of different self-organised and self-adaptive routing algorithms, real time content processing, and real time environment sensing with predictive analytics. Our platform also allows rich interactivity with the user. We show deployment and analysis of two vastly different complex networking systems: fault and disconnection aware smart manufacturing sensor network and cognitive privacy for personal clouds. We show that our platform design can integrate both contexts transparently and organically and allows a wide range of analysis

    Enhancing reliability with Latin Square redundancy on desktop grids.

    Get PDF
    Computational grids are some of the largest computer systems in existence today. Unfortunately they are also, in many cases, the least reliable. This research examines the use of redundancy with permutation as a method of improving reliability in computational grid applications. Three primary avenues are explored - development of a new redundancy model, the Replication and Permutation Paradigm (RPP) for computational grids, development of grid simulation software for testing RPP against other redundancy methods and, finally, running a program on a live grid using RPP. An important part of RPP involves distributing data and tasks across the grid in Latin Square fashion. Two theorems and subsequent proofs regarding Latin Squares are developed. The theorems describe the changing position of symbols between the rows of a standard Latin Square. When a symbol is missing because a column is removed the theorems provide a basis for determining the next row and column where the missing symbol can be found. Interesting in their own right, the theorems have implications for redundancy. In terms of the redundancy model, the theorems allow one to state the maximum makespan in the face of missing computational hosts when using Latin Square redundancy. The simulator software was developed and used to compare different data and task distribution schemes on a simulated grid. The software clearly showed the advantage of running RPP, which resulted in faster completion times in the face of computational host failures. The Latin Square method also fails gracefully in that jobs complete with massive node failure while increasing makespan. Finally an Inductive Logic Program (ILP) for pharmacophore search was executed, using a Latin Square redundancy methodology, on a Condor grid in the Dahlem Lab at the University of Louisville Speed School of Engineering. All jobs completed, even in the face of large numbers of randomly generated computational host failures
    • …
    corecore