Search CORE

19,288 research outputs found

A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

Author: Treaster Michael
Publication venue
Publication date: 31/12/2004
Field of study

Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques.Comment: 11 page

arXiv.org e-Print Archive

CiteSeerX

SEARS: Space Efficient And Reliable Storage System in the Cloud

Author: Guo Katherine
Li Ying
Soljanin Emina
Wang Xin
Woo Thomas
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 05/08/2015
Field of study

Today's cloud storage services must offer storage reliability and fast data retrieval for large amount of data without sacrificing storage cost. We present SEARS, a cloud-based storage system which integrates erasure coding and data deduplication to support efficient and reliable data storage with fast user response time. With proper association of data to storage server clusters, SEARS provides flexible mixing of different configurations, suitable for real-time and archival applications. Our prototype implementation of SEARS over Amazon EC2 shows that it outperforms existing storage systems in storage efficiency and file retrieval time. For 3 MB files, SEARS delivers retrieval time of

2.5

s compared to

7

s with existing systems.Comment: 4 pages, IEEE LCN 201

arXiv.org e-Print Archive

Crossref

Secure pseudo-random linear binary sequences generators based on arithmetic polynoms

Author: A Omondi
B Schneier
B Yang
HL Garner
JA Hetagurov
JM Ortega
MA Soderstrand
OA Finko
OA Finko
VA Krasnobaev
VP Shmerko
WK Jenkins
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 08/09/2014
Field of study

We present a new approach to constructing of pseudo-random binary sequences (PRS) generators for the purpose of cryptographic data protection, secured from the perpetrator's attacks, caused by generation of masses of hardware errors and faults. The new method is based on use of linear polynomial arithmetic for the realization of systems of boolean characteristic functions of PRS' generators. "Arithmetizatio" of systems of logic formulas has allowed to apply mathematical apparatus of residue systems for multisequencing of the process of PRS generation and organizing control of computing errors, caused by hardware faults. This has guaranteed high security of PRS generator's functioning and, consequently, security of tools for cryptographic data protection based on those PRSs

arXiv.org e-Print Archive

Crossref

Fault-tolerant sub-lithographic design with rollback recovery

Author: DeHon André
Naeimi Helia
Publication venue: 'AIP Publishing'
Publication date: 19/03/2008
Field of study

Shrinking feature sizes and energy levels coupled with high clock rates and decreasing node capacitance lead us into a regime where transient errors in logic cannot be ignored. Consequently, several recent studies have focused on feed-forward spatial redundancy techniques to combat these high transient fault rates. To complement these studies, we analyze fine-grained rollback techniques and show that they can offer lower spatial redundancy factors with no significant impact on system performance for fault rates up to one fault per device per ten million cycles of operation (Pf = 10^-7) in systems with 10^12 susceptible devices. Further, we concretely demonstrate these claims on nanowire-based programmable logic arrays. Despite expensive rollback buffers and general-purpose, conservative analysis, we show the area overhead factor of our technique is roughly an order of magnitude lower than a gate level feed-forward redundancy scheme

Caltech Authors

ScholarlyCommons@Penn

Study of a unified hardware and software fault-tolerant architecture

Author: Adams Stuart
Alger Linda
Friend Steven
Greeley Gregory
Lala Jaynarayan
Sacco Stephen
Publication venue
Publication date
Field of study

A unified architectural concept, called the Fault Tolerant Processor Attached Processor (FTP-AP), that can tolerate hardware as well as software faults is proposed for applications requiring ultrareliable computation capability. An emulation of the FTP-AP architecture, consisting of a breadboard Motorola 68010-based quadruply redundant Fault Tolerant Processor, four VAX 750s as attached processors, and four versions of a transport aircraft yaw damper control law, is used as a testbed in the AIRLAB to examine a number of critical issues. Solutions of several basic problems associated with N-Version software are proposed and implemented on the testbed. This includes a confidence voter to resolve coincident errors in N-Version software. A reliability model of N-Version software that is based upon the recent understanding of software failure mechanisms is also developed. The basic FTP-AP architectural concept appears suitable for hosting N-Version application software while at the same time tolerating hardware failures. Architectural enhancements for greater efficiency, software reliability modeling, and N-Version issues that merit further research are identified

NASA Technical Reports Server

Introduction to the special section on dependable network computing

Author: Avresky D. R.
Bruck Jehoshua
Culler David E.
Publication venue
Publication date: 01/02/2001
Field of study

Dependable network computing is becoming a key part of our daily economic and social life. Every day, millions of users and businesses are utilizing the Internet infrastructure for real-time electronic commerce transactions, scheduling important events, and building relationships. While network traffic and the number of users are rapidly growing, the mean-time between failures (MTTF) is surprisingly short; according to recent studies, in the majority of Internet backbone paths, the MTTF is 28 days. This leads to a strong requirement for highly dependable networks, servers, and software systems. The challenge is to build interconnected systems, based on available technology, that are inexpensive, accessible, scalable, and dependable. This special section provides insights into a number of these exciting challenges

Caltech Authors