Search CORE

7,886 research outputs found

Computing in the RAIN: a reliable array of independent nodes

Author: Bohossian Vasken
Bruck Jehoshua
Fan Chenggong C.
LeMahieu Paul S.
Riedel Marc D.
Xu Lihao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1998
Field of study

The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data-storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN-technology has been transferred to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology

CiteSeerX

Caltech Authors

Building on Quicksand

Author: Campbell David
Helland Pat
Publication venue
Publication date: 01/01/2009
Field of study

Reliable systems have always been built out of unreliable components. Early on, the reliable components were small such as mirrored disks or ECC (Error Correcting Codes) in core memory. These systems were designed such that failures of these small components were transparent to the application. Later, the size of the unreliable components grew larger and semantic challenges crept into the application when failures occurred. As the granularity of the unreliable component grows, the latency to communicate with a backup becomes unpalatable. This leads to a more relaxed model for fault tolerance. The primary system will acknowledge the work request and its actions without waiting to ensure that the backup is notified of the work. This improves the responsiveness of the system. There are two implications of asynchronous state capture: 1) Everything promised by the primary is probabilistic. There is always a chance that an untimely failure shortly after the promise results in a backup proceeding without knowledge of the commitment. Hence, nothing is guaranteed! 2) Applications must ensure eventual consistency. Since work may be stuck in the primary after a failure and reappear later, the processing order for work cannot be guaranteed. Platform designers are struggling to make this easier for their applications. Emerging patterns of eventual consistency and probabilistic execution may soon yield a way for applications to express requirements for a "looser" form of consistency while providing availability in the face of ever larger failures. This paper recounts portions of the evolution of these trends, attempts to show the patterns that span these changes, and talks about future directions as we continue to "build on quicksand".Comment: CIDR 200

arXiv.org e-Print Archive

CiteSeerX

Large-scale File System Design and Architecture

Author: Dynda V.
Rydlo P.
Publication venue: 'Czech Technical University in Prague - Central Library'
Publication date: 01/01/2002
Field of study

This paper deals with design issues of a global file system, aiming to provide transparent data availability, security against loss and disclosure, and support for mobile and disconnected clients.First, the paper surveys general challenges and requirements for large-scale file systems, and then the design of particular elementary parts of the proposed file system is presented. This includes the design of the raw system architecture, the design of dynamic file replication with appropriate data consistency, file location and data security.Our proposed system is called Gaston, and will be referred further in the text under this name or its abbreviation GFS (Gaston File System)

Directory of Open Access Journals

CTU Open Journal Systems (Czech Technical University, Prague / České vysoké učení technické v Praze)

AR2T : implementing a truly SRAM-based FPGA on-line concurrent testing

Author: Gustavo Alves
José Manuel Martins Ferreira
Manuel Gericota
Miguel Lino Magalhães da Silva
Publication venue
Publication date: 01/01/2002
Field of study

The new partial and dynamic reconfigurable features offered by new generations of SRAM-based FPGAs may be used to improve the dependability of reconfigurable hardware platforms through the implementation of on-line concurrent testing / fault tolerance mechanisms. However, such mechanisms imply the existence of new test strategies that do not interfere with the current system functionality.The AR2T (Active Replication and Release for Testing) technique is a set of procedures that enables the implementation of a truly non-intrusive structural on-line concurrent testing approach, detecting and avoiding permanent faults and correcting errors due to transient faults. Experimental results prove the effectiveness of these solutions. In relation to a previous technique proposed by the authors as part of the DRAFT FPGA concurrent test methodology, AR2T extends the range of circuits that can be replicated, by introducing a small replication aid block

Repositório Científico do Instituto Politécnico do Porto

Repositório Aberto da Universidade do Porto

Quality assessment technique for ubiquitous software and middleware

Author: Hong G.Y.
James H.
Ryu H.
Publication venue: 'Massey University'
Publication date: 01/01/2006
Field of study

The new paradigm of computing or information systems is ubiquitous computing systems. The technology-oriented issues of ubiquitous computing systems have made researchers pay much attention to the feasibility study of the technologies rather than building quality assurance indices or guidelines. In this context, measuring quality is the key to developing high-quality ubiquitous computing products. For this reason, various quality models have been defined, adopted and enhanced over the years, for example, the need for one recognised standard quality model (ISO/IEC 9126) is the result of a consensus for a software quality model on three levels: characteristics, sub-characteristics, and metrics. However, it is very much unlikely that this scheme will be directly applicable to ubiquitous computing environments which are considerably different to conventional software, trailing a big concern which is being given to reformulate existing methods, and especially to elaborate new assessment techniques for ubiquitous computing environments. This paper selects appropriate quality characteristics for the ubiquitous computing environment, which can be used as the quality target for both ubiquitous computing product evaluation processes ad development processes. Further, each of the quality characteristics has been expanded with evaluation questions and metrics, in some cases with measures. In addition, this quality model has been applied to the industrial setting of the ubiquitous computing environment. These have revealed that while the approach was sound, there are some parts to be more developed in the future

Massey Research Online

Fault-tolerant Distributed Reactive Programming

Author: Freisleben Bernd
Mezini Mira
Mogk Ragnar
Salvaneschi Guido
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 32nd European Conference on Object-Oriented Programming (ECOOP 2018)
Publication date: 01/01/2018
Field of study

In this paper, we present a holistic approach to provide fault tolerance for distributed reactive programming. Our solution automatically stores and recovers program state to handle crashes, automatically updates and shares distributed parts of the state to provide eventual consistency, and handles errors in a fine-grained manner to allow precise manual control when necessary. By making use of the reactive programming paradigm, we provide these mechanisms without changing the behavior of existing programs and with reasonable performance, as indicated by our experimental evaluation

TUbiblio

Dagstuhl Research Online Publication Server