21 research outputs found

    Fault and Error Latency Under Real Workload: an Experimental Study

    Get PDF
    A practical methodology for the study of fault and error latency is demonstrated under a real workload. This is the first study that measures and quantifies the latency under real workload and fills a major gap in the current understanding of workload-failure relationships. The methodology is based on low level data gathered on a VAX 11/780 during the normal workload conditions of the installation. Fault occurrence is simulated on the data, and the error generation and discovery process is reconstructed to determine latency. The analysis proceeds to combine the low level activity data with high level machine performance data to yield a better understanding of the phenomena. A strong relationship exists between latency and workload and that relationship is quantified. The sampling and reconstruction techniques used are also validated. Error latency in the memory where the operating system resides was studied using data on the physical memory access. Fault latency in the paged section of memory was determined using data from physical memory scans. Error latency in the microcontrol store was studied using data on the microcode access and usage

    The Effect of System Workload on Error Latency: An Experimental Study

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryJoint Services Electronics Program / N00014-84-C-0149Graduate Research Board, University of Illinois at Urbana-Champaig

    Fault and Error Latency Under Real Workload - An Experimental Study

    No full text
    Joint Services Electronics Program (JSEP) / N00014-84-C-0149National Aeronautics and Space Administration (NASA) / NAG-1-613Graduate Research Board, U of I (GRB)Ope

    Innovative Idea Category Software Probes and a Self-testing System- for Failure Detection and Diagnosis 1

    No full text
    Akey problem in todays complex software systems is software failure detection and isolation. Given that most software failures are only partial and if e ciently diagnosed, isolated and recovered, they could avert a total outage. The probe detects failed software components in a running software system by requesting service, or a certain level of service, from a set of functions, modules and/or subsystems (target) and checking the response to the request. The objective is to localize the failure only up to the level of a target, however, achieve a high degree of e ciency and con dence in the process. Targets can be identi ed at di erent levels or layers in the software, the choice based on the granularity of fault detection that is desired, taken in consideration with the level at which recovery can be implemented. The implementation of the probe system is made self testing against any single failure in its operational components, using the idea of a null probe. The probe system has been designed taking advantage of the latency characteristics of errors to provide a low-overhead mechanism. The ideas are implementable in either a single or multiple computer system

    Fault and Error Latency Under Real Workload - an Experimental Study

    No full text
    99 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 1986.This thesis demonstrates a practical methodology for the study of fault and error latency under real workload. This is the first study that measures and quantifies the latency under real workload and fills a major gap in the current understanding of workload-failure relationships. The methodology is based on low level data gathered on a VAX 11/780 during the normal workload conditions of the installation. Fault occurrence is simulated on the data, and the error generation and discovery process is reconstructed to determine latency. The analysis proceeds to combine the low level activity data with high level machine performance data to yield a better understanding of the phenomenon. This study finds a strong relationship between latency and workload and quantifies the relationship. The sampling and reconstruction techniques used are also validated.Error latency in the memory where the operating system resides is studied using data on physical memory access. These data are gathered through hardware probes in the machine that samples the system during the normal workload cycle of the installation. The technique provides a means to study the system under different workloads and for multiple days. These data are used to reconstruct the error discovery process in the system. An approach to determine the fault miss percentage is developed and a verification of the entire methodology is also performed. This study finds that the mean error latency, in the memory containing the operating system, varies by a factor of 10 to 1 (in hours) between the low and high workloads. It is also found that of all errors occurring within a day, 70% are detected in the same day, 82% within the following day, and 91% within the third day.Fault latency in the paged sections of memory is determined using data from physical memory scans. Fault latency distributions are generated for s-a-0 and s-a-1 permanent fault models. Results show that the mean fault latency of a s-a-0 fault is nearly 5 times that of the s-a-1 fault. Performance data gathered on the machine are used to study a workload-latency behavior. An analysis of variance model to quantify the relative influence of various workload measures on the evaluated latency is also given.Error latency in the microcontrol store is studied using data on the microcode access and usage. These data are acquired using probes in the microsequencer of the CPU. It is found that the latency distribution has a large mode between 50 and 100 microcycles and two additional smaller modes. It is interesting to note that the error latency distribution in the microcontrol store is not exponential as suggested by other reported research.U of I OnlyRestricted to the U of I community idenfinitely during batch ingest of legacy ETD

    Challenges facing Software Fault-tolerance

    No full text
    As software dominates most discussions in the information technology business, one needs to carefully examine where we are headed in software dependability. This paper re-examines some of the basic premises upon which the area of software fault-tolerance is built and critiques some current practices and beliefs. A few of the thoughts and contributions are: The de nition of a software failure needs to change from a speci cation based thought to one of customer expectation and ability to do productive work. This will cause a signi cant shift on what we build fault-tolerance for. However, it would also help narrow the gap between today's theory, practice and customer need. Data on customer problems illustrates that 90 % of the problems reported are what we have traditionally considered as non-defect { implying no need for a programming change. However, with the new de nition of failure, we will need to address this more seriously as a part of fault-tolerance. This change could level the playing eld and help achieve greater customer satisfaction. A rationale for determining the amount of fault-tolerance based on the concept of th
    corecore