3,642 research outputs found
Correct and Control Complex IoT Systems: Evaluation of a Classification for System Anomalies
In practice there are deficiencies in precise interteam communications about
system anomalies to perform troubleshooting and postmortem analysis along
different teams operating complex IoT systems. We evaluate the quality in use
of an adaptation of IEEE Std. 1044-2009 with the objective to differentiate the
handling of fault detection and fault reaction from handling of defect and its
options for defect correction. We extended the scope of IEEE Std. 1044-2009
from anomalies related to software only to anomalies related to complex IoT
systems. To evaluate the quality in use of our classification a study was
conducted at Robert Bosch GmbH. We applied our adaptation to a postmortem
analysis of an IoT solution and evaluated the quality in use by conducting
interviews with three stakeholders. Our adaptation was effectively applied and
interteam communications as well as iterative and inductive learning for
product improvement were enhanced. Further training and practice are required.Comment: Submitted to QRS 2020 (IEEE Conference on Software Quality,
Reliability and Security
Combined Time and Information Redundancy for SEU-Tolerance in Energy-Efficient Real-Time Systems
Recently the trade-off between energy consumption and fault-tolerance in real-time systems has been highlighted. These works have focused on dynamic voltage scaling (DVS) to reduce dynamic energy dissipation and on time redundancy to achieve transient-fault tolerance. While the time redundancy technique exploits the available slack time to increase the fault-tolerance by performing recovery executions, DVS exploits slack time to save energy. Therefore we believe there is a resource conflict between the time-redundancy technique and DVS. The first aim of this paper is to propose the usage of information redundancy to solve this problem. We demonstrate through analytical and experimental studies that it is possible to achieve both higher transient fault-tolerance (tolerance to single event upsets (SEU)) and less energy using a combination of information and time redundancy when compared with using time redundancy alone. The second aim of this paper is to analyze the interplay of transient-fault tolerance (SEU-tolerance) and adaptive body biasing (ABB) used to reduce static leakage energy, which has not been addressed in previous studies. We show that the same technique (i.e. the combination of time and information redundancy) is applicable to ABB-enabled systems and provides more advantages than time redundancy alone
Integration of tools for the Design and Assessment of High-Performance, Highly Reliable Computing Systems (DAHPHRS), phase 1
Systems for Space Defense Initiative (SDI) space applications typically require both high performance and very high reliability. These requirements present the systems engineer evaluating such systems with the extremely difficult problem of conducting performance and reliability trade-offs over large design spaces. A controlled development process supported by appropriate automated tools must be used to assure that the system will meet design objectives. This report describes an investigation of methods, tools, and techniques necessary to support performance and reliability modeling for SDI systems development. Models of the JPL Hypercubes, the Encore Multimax, and the C.S. Draper Lab Fault-Tolerant Parallel Processor (FTPP) parallel-computing architectures using candidate SDI weapons-to-target assignment algorithms as workloads were built and analyzed as a means of identifying the necessary system models, how the models interact, and what experiments and analyses should be performed. As a result of this effort, weaknesses in the existing methods and tools were revealed and capabilities that will be required for both individual tools and an integrated toolset were identified
RescueSNN: Enabling Reliable Executions on Spiking Neural Network Accelerators under Permanent Faults
To maximize the performance and energy efficiency of Spiking Neural Network
(SNN) processing on resource-constrained embedded systems, specialized hardware
accelerators/chips are employed. However, these SNN chips may suffer from
permanent faults which can affect the functionality of weight memory and neuron
behavior, thereby causing potentially significant accuracy degradation and
system malfunctioning. Such permanent faults may come from manufacturing
defects during the fabrication process, and/or from device/transistor damages
(e.g., due to wear out) during the run-time operation. However, the impact of
permanent faults in SNN chips and the respective mitigation techniques have not
been thoroughly investigated yet. Toward this, we propose RescueSNN, a novel
methodology to mitigate permanent faults in the compute engine of SNN chips
without requiring additional retraining, thereby significantly cutting down the
design time and retraining costs, while maintaining the throughput and quality.
The key ideas of our RescueSNN methodology are (1) analyzing the
characteristics of SNN under permanent faults; (2) leveraging this analysis to
improve the SNN fault-tolerance through effective fault-aware mapping (FAM);
and (3) devising lightweight hardware enhancements to support FAM. Our FAM
technique leverages the fault map of SNN compute engine for (i) minimizing
weight corruption when mapping weight bits on the faulty memory cells, and (ii)
selectively employing faulty neurons that do not cause significant accuracy
degradation to maintain accuracy and throughput, while considering the SNN
operations and processing dataflow. The experimental results show that our
RescueSNN improves accuracy by up to 80% while maintaining the throughput
reduction below 25% in high fault rate (e.g., 0.5 of the potential fault
locations), as compared to running SNNs on the faulty chip without mitigation.Comment: Accepted for publication at Frontiers in Neuroscience - Section
Neuromorphic Engineerin
Study of fault-tolerant software technology
Presented is an overview of the current state of the art of fault-tolerant software and an analysis of quantitative techniques and models developed to assess its impact. It examines research efforts as well as experience gained from commercial application of these techniques. The paper also addresses the computer architecture and design implications on hardware, operating systems and programming languages (including Ada) of using fault-tolerant software in real-time aerospace applications. It concludes that fault-tolerant software has progressed beyond the pure research state. The paper also finds that, although not perfectly matched, newer architectural and language capabilities provide many of the notations and functions needed to effectively and efficiently implement software fault-tolerance
An Assessment to Benchmark the Seismic Performance of a Code-Conforming Reinforced-Concrete Moment-Frame Building
This report describes a state-of-the-art performance-based earthquake engineering methodology
that is used to assess the seismic performance of a four-story reinforced concrete (RC) office
building that is generally representative of low-rise office buildings constructed in highly seismic
regions of California. This “benchmark” building is considered to be located at a site in the Los
Angeles basin, and it was designed with a ductile RC special moment-resisting frame as its
seismic lateral system that was designed according to modern building codes and standards. The
building’s performance is quantified in terms of structural behavior up to collapse, structural and
nonstructural damage and associated repair costs, and the risk of fatalities and their associated
economic costs. To account for different building configurations that may be designed in
practice to meet requirements of building size and use, eight structural design alternatives are
used in the performance assessments.
Our performance assessments account for important sources of uncertainty in the ground
motion hazard, the structural response, structural and nonstructural damage, repair costs, and
life-safety risk. The ground motion hazard characterization employs a site-specific probabilistic
seismic hazard analysis and the evaluation of controlling seismic sources (through
disaggregation) at seven ground motion levels (encompassing return periods ranging from 7 to
2475 years). Innovative procedures for ground motion selection and scaling are used to develop
acceleration time history suites corresponding to each of the seven ground motion levels.
Structural modeling utilizes both “fiber” models and “plastic hinge” models. Structural
modeling uncertainties are investigated through comparison of these two modeling approaches,
and through variations in structural component modeling parameters (stiffness, deformation
capacity, degradation, etc.). Structural and nonstructural damage (fragility) models are based on
a combination of test data, observations from post-earthquake reconnaissance, and expert
opinion. Structural damage and repair costs are modeled for the RC beams, columns, and slabcolumn connections. Damage and associated repair costs are considered for some nonstructural
building components, including wallboard partitions, interior paint, exterior glazing, ceilings,
sprinkler systems, and elevators. The risk of casualties and the associated economic costs are
evaluated based on the risk of structural collapse, combined with recent models on earthquake
fatalities in collapsed buildings and accepted economic modeling guidelines for the value of
human life in loss and cost-benefit studies.
The principal results of this work pertain to the building collapse risk, damage and repair
cost, and life-safety risk. These are discussed successively as follows.
When accounting for uncertainties in structural modeling and record-to-record variability
(i.e., conditional on a specified ground shaking intensity), the structural collapse probabilities of
the various designs range from 2% to 7% for earthquake ground motions that have a 2%
probability of exceedance in 50 years (2475 years return period). When integrated with the
ground motion hazard for the southern California site, the collapse probabilities result in mean
annual frequencies of collapse in the range of [0.4 to 1.4]x10
-4
for the various benchmark
building designs. In the development of these results, we made the following observations that
are expected to be broadly applicable:
(1) The ground motions selected for performance simulations must consider spectral
shape (e.g., through use of the epsilon parameter) and should appropriately account for
correlations between motions in both horizontal directions;
(2) Lower-bound component models, which are commonly used in performance-based
assessment procedures such as FEMA 356, can significantly bias collapse analysis results; it is
more appropriate to use median component behavior, including all aspects of the component
model (strength, stiffness, deformation capacity, cyclic deterioration, etc.);
(3) Structural modeling uncertainties related to component deformation capacity and
post-peak degrading stiffness can impact the variability of calculated collapse probabilities and
mean annual rates to a similar degree as record-to-record variability of ground motions.
Therefore, including the effects of such structural modeling uncertainties significantly increases
the mean annual collapse rates. We found this increase to be roughly four to eight times relative
to rates evaluated for the median structural model;
(4) Nonlinear response analyses revealed at least six distinct collapse mechanisms, the
most common of which was a story mechanism in the third story (differing from the multi-story
mechanism predicted by nonlinear static pushover analysis);
(5) Soil-foundation-structure interaction effects did not significantly affect the structural
response, which was expected given the relatively flexible superstructure and stiff soils.
The potential for financial loss is considerable. Overall, the calculated expected annual
losses (EAL) are in the range of 97,000 for the various code-conforming benchmark
building designs, or roughly 1% of the replacement cost of the building (3.5M, the fatality rate translates to an EAL due to
fatalities of 5,600 for the code-conforming designs, and 66,000, the monetary value associated with life loss is small,
suggesting that the governing factor in this respect will be the maximum permissible life-safety
risk deemed by the public (or its representative government) to be appropriate for buildings.
Although the focus of this report is on one specific building, it can be used as a reference
for other types of structures. This report is organized in such a way that the individual core
chapters (4, 5, and 6) can be read independently. Chapter 1 provides background on the
performance-based earthquake engineering (PBEE) approach. Chapter 2 presents the
implementation of the PBEE methodology of the PEER framework, as applied to the benchmark
building. Chapter 3 sets the stage for the choices of location and basic structural design. The subsequent core chapters focus on the hazard analysis (Chapter 4), the structural analysis
(Chapter 5), and the damage and loss analyses (Chapter 6). Although the report is self-contained,
readers interested in additional details can find them in the appendices
On the functional test of the BTB logic in pipelined and superscalar processors
Electronic systems are increasingly used for safety-critical applications, where the effects of faults must be taken under control and hopefully avoided. For this purpose, test of manufactured devices is particularly important, both at the end of the production line and during the operational phase. This paper describes a method to test the logic implementing the Branch Prediction Unit in pipelined and superscalar processors when this follows the Branch Target Buffer (BTB) architecture; the proposed approach is functional, i.e., it is based on forcing the processor to execute a suitably devised test program and observing the produced results. Experimental results are provided on the DLX processor, showing that the method can achieve a high value of stuck-at fault coverage while also testing the memory in the BT
- …