798 research outputs found

    Programming your way out of the past: ISIS and the META Project

    Get PDF
    The ISIS distributed programming system and the META Project are described. The ISIS programming toolkit is an aid to low-level programming that makes it easy to build fault-tolerant distributed applications that exploit replication and concurrent execution. The META Project is reexamining high-level mechanisms such as the filesystem, shell language, and administration tools in distributed systems

    Checkpointing as a Service in Heterogeneous Cloud Environments

    Get PDF
    A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long-running jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and OpenStack. The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201

    Group Communication in Amoeba and its Applications

    Get PDF
    Unlike many other operating systems, Amoeba is a distributed operating system that provides group communication (i.e., one-to-many communication). We wil

    ISIS and META projects

    Get PDF
    The ISIS project has developed a new methodology, virtual synchony, for writing robust distributed software. High performance multicast, large scale applications, and wide area networks are the focus of interest. Several interesting applications that exploit the strengths of ISIS, including an NFS-compatible replicated file system, are being developed. The META project is distributed control in a soft real-time environment incorporating feedback. This domain encompasses examples as diverse as monitoring inventory and consumption on a factory floor, and performing load-balancing on a distributed computing system. One of the first uses of META is for distributed application management: the tasks of configuring a distributed program, dynamically adapting to failures, and monitoring its performance. Recent progress and current plans are reported

    The ISIS Project: Real Experience with a Fault Tolerant Programming System

    Get PDF
    The ISIS project has developed a distributed programming toolkit and a collection of higher level applications based on these tools. ISIS is now in use at more than 300 locations world-wise. The lessons (and surprises) gained from this experience with the real world are discussed

    Intelligent architecture for automatic resource allocation in computer clusters

    Get PDF
    As the need for more reporting and assessment of information increase exponentially, computer-based applications consume resources at an alarmingly rapid rate. Therefore, traditional techniques for managing resource allocation, topology and systems need urgent revision. In this paper, we present an intelligent architecture that introduces a new strategy for managing resource discovery, allocation and dynamic reconfiguration at run-time. Our building methodology involves the employment of new types of clustered systems based on large application groupings, each having a master cluster controller. Each controlling engine consists of self-healing intelligent entities that can compensate for a variety of software or hardware problems. We also present evaluation results of extensive experiments in a production environment, which demonstrate the advantages of our approach
    corecore