1,217 research outputs found

    Total order broadcast for fault tolerant exascale systems

    Full text link
    In the process of designing a new fault tolerant run-time for future exascale systems, we discovered that a total order broadcast would be necessary. That is, nodes of a supercomputer should be able to broadcast messages to other nodes even in the face of failures. All messages should be seen in the same order at all nodes. While this is a well studied problem in distributed systems, few researchers have looked at how to perform total order broadcasts at large scales for data availability. Our experience implementing a published total order broadcast algorithm showed poor scalability at tens of nodes. In this paper we present a novel algorithm for total order broadcast which scales logarithmically in the number of processes and is not delayed by most process failures. While we are motivated by the needs of our run-time we believe this primitive is of general applicability. Total order broadcasts are used often in datacenter environments and as HPC developers begins to address fault tolerance at the application level we believe they will need similar primitives

    The Isis project: Fault-tolerance in large distributed systems

    Get PDF
    This final status report covers activities of the Isis project during the first half of 1992. During the report period, the Isis effort has achieved a major milestone in its effort to redesign and reimplement the Isis system using Mach and Chorus as target operating system environments. In addition, we completed a number of publications that address issues raised in our prior work; some of these have recently appeared in print, while others are now being considered for publication in a variety of journals and conferences

    Combining high performance and fault tolerance in a distributed file server

    Get PDF
    Among the most reliable and fault tolerant components in a distributed system are storage systems. Obviously, reliability of storage systems belongs to the most researched issues in distributed computing. Every distributed file system project is based on different assumptions about size, load, amount of sharing, and desirable semantics, making it hard to compare research results fairly. The current Amoeba file server is the Bullet File Server [van Renesse, Tanenbaum, and Wilschut, 1989] which provides immutable files, is optimized for whole-file transfer and does caching at the file server. It has excellent performance for reading cached files (1.5 + 1.5 n ms for n kilobytes) and for sustained file I/O (680 kilobytes per second, both on read and write). Although performance is excellent, there is room for improvement, especially in the area of fault tolerance, sharing semantics and caching. I am currently doing the back-of-the-envelope design for a new file server that will form the basis of both our normal file system and of a complex-object server which is being designed by the database group at CWI. In addition to those desirable properties of fault tolerance, persistency, consistency, and availability, I am anxious to achieve even better performance than the Bullet server by extensive use of client and server caching.This position paper presents some of our design ideas. Note that this is work in progress; that

    ISIS and META projects

    Get PDF
    The ISIS project has developed a new methodology, virtual synchony, for writing robust distributed software. High performance multicast, large scale applications, and wide area networks are the focus of interest. Several interesting applications that exploit the strengths of ISIS, including an NFS-compatible replicated file system, are being developed. The META project is distributed control in a soft real-time environment incorporating feedback. This domain encompasses examples as diverse as monitoring inventory and consumption on a factory floor, and performing load-balancing on a distributed computing system. One of the first uses of META is for distributed application management: the tasks of configuring a distributed program, dynamically adapting to failures, and monitoring its performance. Recent progress and current plans are reported

    A Survey of Green Networking Research

    Full text link
    Reduction of unnecessary energy consumption is becoming a major concern in wired networking, because of the potential economical benefits and of its expected environmental impact. These issues, usually referred to as "green networking", relate to embedding energy-awareness in the design, in the devices and in the protocols of networks. In this work, we first formulate a more precise definition of the "green" attribute. We furthermore identify a few paradigms that are the key enablers of energy-aware networking research. We then overview the current state of the art and provide a taxonomy of the relevant work, with a special focus on wired networking. At a high level, we identify four branches of green networking research that stem from different observations on the root causes of energy waste, namely (i) Adaptive Link Rate, (ii) Interface proxying, (iii) Energy-aware infrastructures and (iv) Energy-aware applications. In this work, we do not only explore specific proposals pertaining to each of the above branches, but also offer a perspective for research.Comment: Index Terms: Green Networking; Wired Networks; Adaptive Link Rate; Interface Proxying; Energy-aware Infrastructures; Energy-aware Applications. 18 pages, 6 figures, 2 table

    A survey of general-purpose experiment management tools for distributed systems

    Get PDF
    International audienceIn the field of large-scale distributed systems, experimentation is particularly difficult. The studied systems are complex, often nondeterministic and unreliable, software is plagued with bugs, whereas the experiment workflows are unclear and hard to reproduce. These obstacles led many independent researchers to design tools to control their experiments, boost productivity and improve quality of scientific results. Despite much research in the domain of distributed systems experiment management, the current fragmentation of efforts asks for a general analysis. We therefore propose to build a framework to uncover missing functionality of these tools, enable meaningful comparisons be-tween them and find recommendations for future improvements and research. The contribution in this paper is twofold. First, we provide an extensive list of features offered by general-purpose experiment management tools dedicated to distributed systems research on real platforms. We then use it to assess existing solutions and compare them, outlining possible future paths for improvements
    corecore