10 research outputs found

    Software fault tolerance in computer operating systems

    Get PDF
    This chapter provides data and analysis of the dependability and fault tolerance for three operating systems: the Tandem/GUARDIAN fault-tolerant system, the VAX/VMS distributed system, and the IBM/MVS system. Based on measurements from these systems, basic software error characteristics are investigated. Fault tolerance in operating systems resulting from the use of process pairs and recovery routines is evaluated. Two levels of models are developed to analyze error and recovery processes inside an operating system and interactions among multiple instances of an operating system running in a distributed environment. The measurements show that the use of process pairs in Tandem systems, which was originally intended for tolerating hardware faults, allows the system to tolerate about 70% of defects in system software that result in processor failures. The loose coupling between processors which results in the backup execution (the processor state and the sequence of events occurring) being different from the original execution is a major reason for the measured software fault tolerance. The IBM/MVS system fault tolerance almost doubles when recovery routines are provided, in comparison to the case in which no recovery routines are available. However, even when recovery routines are provided, there is almost a 50% chance of system failure when critical system jobs are involved

    Experimental analysis of computer system dependability

    Get PDF
    This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance

    Measurement and Analysis of Operating System Fault Tolerance

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryONR / N00014-91-J-1116NASA / NAG-1-61

    A real-time diagnostic and performance monitor for UNIX

    Get PDF
    There are now over one million UNIX sites and the pace at which new installations are added is steadily increasing. Along with this increase, comes a need to develop simple efficient, effective and adaptable ways of simultaneously collecting real-time diagnostic and performance data. This need exists because distributed systems can give rise to complex failure situations that are often un-identifiable with single-machine diagnostic software. The simultaneous collection of error and performance data is also important for research in failure prediction and error/performance studies. This paper introduces a portable method to concurrently collect real-time diagnostic and performance data on a distributed UNIX system. The combined diagnostic/performance data collection is implemented on a distributed multi-computer system using SUN4's as servers. The approach uses existing UNIX system facilities to gather system dependability information such as error and crash reports. In addition, performance data such as CPU utilization, disk usage, I/O transfer rate and network contention is also collected. In the future, the collected data will be used to identify dependability bottlenecks and to analyze the impact of failures on system performance

    Log-based software monitoring: a systematic mapping study

    Full text link
    Modern software development and operations rely on monitoring to understand how systems behave in production. The data provided by application logs and runtime environment are essential to detect and diagnose undesired behavior and improve system reliability. However, despite the rich ecosystem around industry-ready log solutions, monitoring complex systems and getting insights from log data remains a challenge. Researchers and practitioners have been actively working to address several challenges related to logs, e.g., how to effectively provide better tooling support for logging decisions to developers, how to effectively process and store log data, and how to extract insights from log data. A holistic view of the research effort on logging practices and automated log analysis is key to provide directions and disseminate the state-of-the-art for technology transfer. In this paper, we study 108 papers (72 research track papers, 24 journals, and 12 industry track papers) from different communities (e.g., machine learning, software engineering, and systems) and structure the research field in light of the life-cycle of log data. Our analysis shows that (1) logging is challenging not only in open-source projects but also in industry, (2) machine learning is a promising approach to enable a contextual analysis of source code for log recommendation but further investigation is required to assess the usability of those tools in practice, (3) few studies approached efficient persistence of log data, and (4) there are open opportunities to analyze application logs and to evaluate state-of-the-art log analysis techniques in a DevOps context

    First International Conference on Ada (R) Programming Language Applications for the NASA Space Station, volume 2

    Get PDF
    Topics discussed include: reusability; mission critical issues; run time; expert systems; language issues; life cycle issues; software tools; and computers for Ada

    Space and Earth Sciences, Computer Systems, and Scientific Data Analysis Support, Volume 1

    Get PDF
    This Final Progress Report covers the specific technical activities of Hughes STX Corporation for the last contract triannual period of 1 June through 30 Sep. 1993, in support of assigned task activities at Goddard Space Flight Center (GSFC). It also provides a brief summary of work throughout the contract period of performance on each active task. Technical activity is presented in Volume 1, while financial and level-of-effort data is presented in Volume 2. Technical support was provided to all Division and Laboratories of Goddard's Space Sciences and Earth Sciences Directorates. Types of support include: scientific programming, systems programming, computer management, mission planning, scientific investigation, data analysis, data processing, data base creation and maintenance, instrumentation development, and management services. Mission and instruments supported include: ROSAT, Astro-D, BBXRT, XTE, AXAF, GRO, COBE, WIND, UIT, SMM, STIS, HEIDI, DE, URAP, CRRES, Voyagers, ISEE, San Marco, LAGEOS, TOPEX/Poseidon, Pioneer-Venus, Galileo, Cassini, Nimbus-7/TOMS, Meteor-3/TOMS, FIFE, BOREAS, TRMM, AVHRR, and Landsat. Accomplishments include: development of computing programs for mission science and data analysis, supercomputer applications support, computer network support, computational upgrades for data archival and analysis centers, end-to-end management for mission data flow, scientific modeling and results in the fields of space and Earth physics, planning and design of GSFC VO DAAC and VO IMS, fabrication, assembly, and testing of mission instrumentation, and design of mission operations center

    Caractérisation de la sûreté de fonctionnement des systèmes d'exploitation en présence de pilotes défaillants

    Get PDF
    Les pilotes de périphériques composent désormais une part essentielle des systèmes d’exploitation. Plusieurs études montrent qu’ils sont fréquemment à l’origine des dysfonctionnements des systèmes opératoires. Dans ce mémoire, nous présentons une méthode pour l’évaluation de la robustesse des noyaux face aux comportements anormaux des pilotes de périphériques. Pour cela, après avoir analysé et précisé les caractéristiques des échanges entre les pilotes et le noyau (DPI - Driver Programming Interface), nous proposons une technique originale d’injection de fautes basée sur la corruption des paramètres des fonctions manipulées au niveau de cette interface. Nous définissons différentes approches pour l’analyse et l’interprétation des résultats observés pour obtenir des mesures objectives de sûreté de fonctionnement. Ces mesures permettent la prise en compte de différents points de vue afin de répondre aux besoins réels de l’utilisateur. Enfin, nous illustrons et validons l’applicabilité de cette méthode par sa mise en oeuvre dans le cadre d’un environnement expérimental sous Linux. La méthode proposée contribue à la caractérisation de la sûreté de fonctionnement des noyaux vis-à-vis des défaillances des pilotes du système. L’impact des résultats est double : a) permettre au développeur de tels logiciels d’identifier les faiblesses potentielles affectant la sûreté de fonctionnement du système, b) aider un intégrateur dans le choix du composant le mieux adapté à ses besoins

    Third International Symposium on Space Mission Operations and Ground Data Systems, part 1

    Get PDF
    Under the theme of 'Opportunities in Ground Data Systems for High Efficiency Operations of Space Missions,' the SpaceOps '94 symposium included presentations of more than 150 technical papers spanning five topic areas: Mission Management, Operations, Data Management, System Development, and Systems Engineering. The papers focus on improvements in the efficiency, effectiveness, productivity, and quality of data acquisition, ground systems, and mission operations. New technology, techniques, methods, and human systems are discussed. Accomplishments are also reported in the application of information systems to improve data retrieval, reporting, and archiving; the management of human factors; the use of telescience and teleoperations; and the design and implementation of logistics support for mission operations
    corecore