379 research outputs found
Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 1: Army fault tolerant architecture overview
Digital computing systems needed for Army programs such as the Computer-Aided Low Altitude Helicopter Flight Program and the Armored Systems Modernization (ASM) vehicles may be characterized by high computational throughput and input/output bandwidth, hard real-time response, high reliability and availability, and maintainability, testability, and producibility requirements. In addition, such a system should be affordable to produce, procure, maintain, and upgrade. To address these needs, the Army Fault Tolerant Architecture (AFTA) is being designed and constructed under a three-year program comprised of a conceptual study, detailed design and fabrication, and demonstration and validation phases. Described here are the results of the conceptual study phase of the AFTA development. Given here is an introduction to the AFTA program, its objectives, and key elements of its technical approach. A format is designed for representing mission requirements in a manner suitable for first order AFTA sizing and analysis, followed by a discussion of the current state of mission requirements acquisition for the targeted Army missions. An overview is given of AFTA's architectural theory of operation
On Fault Tolerance Methods for Networks-on-Chip
Technology scaling has proceeded into dimensions in which the reliability of manufactured devices is becoming endangered. The reliability decrease is a consequence of physical limitations, relative increase of variations, and decreasing noise margins, among others. A promising solution for bringing the reliability of circuits back to a desired level is the use of design methods which introduce tolerance against possible faults in an integrated circuit.
This thesis studies and presents fault tolerance methods for network-onchip (NoC) which is a design paradigm targeted for very large systems-onchip. In a NoC resources, such as processors and memories, are connected to a communication network; comparable to the Internet. Fault tolerance in such a system can be achieved at many abstraction levels.
The thesis studies the origin of faults in modern technologies and explains the classification to transient, intermittent and permanent faults. A survey of fault tolerance methods is presented to demonstrate the diversity of available methods. Networks-on-chip are approached by exploring their main design choices: the selection of a topology, routing protocol, and flow control method. Fault tolerance methods for NoCs are studied at different layers of the OSI reference model.
The data link layer provides a reliable communication link over a physical channel. Error control coding is an efficient fault tolerance method especially against transient faults at this abstraction level. Error control coding methods suitable for on-chip communication are studied and their implementations presented. Error control coding loses its effectiveness in the presence of intermittent and permanent faults. Therefore, other solutions against them are presented. The introduction of spare wires and split transmissions are shown to provide good tolerance against intermittent and permanent errors and their combination to error control coding is illustrated.
At the network layer positioned above the data link layer, fault tolerance can be achieved with the design of fault tolerant network topologies and routing algorithms. Both of these approaches are presented in the thesis together with realizations in the both categories. The thesis concludes that an optimal fault tolerance solution contains carefully co-designed elements from different abstraction levelsSiirretty Doriast
The "MIND" Scalable PIM Architecture
MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a
Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on
each chip and supports global shared memory across systems of MIND components. MIND is distinguished from other PIM architectures in that it incorporates mechanisms for efficient support of a global parallel execution model based on the semantics of message-driven multithreaded split-transaction processing. MIND is designed to operate either in conjunction with other conventional microprocessors or in standalone arrays of like devices. It also incorporates mechanisms for fault tolerance, real time execution, and active power management. This paper describes the major elements and operational methods of the MIND
architecture
SkyMapper Southern Survey: First Data Release (DR1)
We present the first data release (DR1) of the SkyMapper Southern Survey, a
hemispheric survey carried out with the SkyMapper Telescope at Siding Spring
Observatory in Australia. Here, we present the survey strategy, data
processing, catalogue construction and database schema. The DR1 dataset
includes over 66,000 images from the Shallow Survey component, covering an area
of 17,200 deg in all six SkyMapper passbands , while the full area
covered by any passband exceeds 20,000 deg. The catalogues contain over 285
million unique astrophysical objects, complete to roughly 18 mag in all bands.
We compare our point-source photometry with PanSTARRS1 DR1 and note an
RMS scatter of 2%. The internal reproducibility of SkyMapper photometry is on
the order of 1%. Astrometric precision is better than 0.2 arcsec based on
comparison with Gaia DR1. We describe the end-user database, through which data
are presented to the world community, and provide some illustrative science
queries.Comment: 31 pages, 19 figures, 10 tables, PASA, accepte
Hardware Error Detection Using AN-Codes
Due to the continuously decreasing feature sizes and the increasing complexity of integrated circuits, commercial off-the-shelf (COTS) hardware is becoming less and less reliable. However, dedicated reliable hardware is expensive and usually slower than commodity hardware. Thus, economic pressure will most likely result in the usage of unreliable COTS hardware in safety-critical systems.
The usage of unreliable, COTS hardware in safety-critical systems results in the need for software-implemented solutions for handling execution errors caused by this unreliable hardware. In this thesis, we provide techniques for detecting hardware errors that disturb the execution of a program. The detection provided facilitates handling of these errors, for example, by retry or graceful degradation.
We realize the error detection by transforming unsafe programs that are not guaranteed to detect execution errors into safe programs that detect execution errors with a high probability. Therefore, we use arithmetic AN-, ANB-, ANBD-, and ANBDmem-codes. These codes detect errors that modify data during storage or transport and errors that disturb computations as well. Furthermore, the error detection provided is independent of the hardware used.
We present the following novel encoding approaches:
- Software Encoded Processing (SEP) that transforms an unsafe binary into a safe execution at runtime by applying an ANB-code, and
- Compiler Encoded Processing (CEP) that applies encoding at compile time and provides different levels of safety by using different arithmetic codes.
In contrast to existing encoding solutions, SEP and CEP allow to encode applications whose data and control flow is not completely predictable at compile time.
For encoding, SEP and CEP use our set of encoded operations also presented in this thesis. To the best of our knowledge, we are the first ones that present the encoding of a complete RISC instruction set including boolean and bitwise logical operations, casts, unaligned loads and stores, shifts and arithmetic operations.
Our evaluations show that encoding with SEP and CEP significantly reduces the amount of erroneous output caused by hardware errors. Furthermore, our evaluations show that, in contrast to replication-based approaches for detecting errors, arithmetic encoding facilitates the detection of permanent hardware errors.
This increased reliability does not come for free. However, unexpectedly the runtime costs for the different arithmetic codes supported by CEP compared to redundancy increase only linearly, while the gained safety increases exponentially
Cross-layer reliability evaluation, moving from the hardware architecture to the system level: A CLERECO EU project overview
Advanced computing systems realized in forthcoming technologies hold the promise of a significant increase of computational capabilities. However, the same path that is leading technologies toward these remarkable achievements is also making electronic devices increasingly unreliable. Developing new methods to evaluate the reliability of these systems in an early design stage has the potential to save costs, produce optimized designs and have a positive impact on the product time-to-market.
CLERECO European FP7 research project addresses early reliability evaluation with a cross-layer approach across different computing disciplines, across computing system layers and across computing market segments. The fundamental objective of the project is to investigate in depth a methodology to assess system reliability early in the design cycle of the future systems of the emerging computing continuum. This paper presents a general overview of the CLERECO project focusing on the main tools and models that are being developed that could be of interest for the research community and engineering practice
Design, Analysis and Test of Logic Circuits under Uncertainty.
Integrated circuits are increasingly susceptible to uncertainty caused by soft
errors, inherently probabilistic devices, and manufacturing variability. As device technologies
scale, these effects become detrimental to circuit reliability. In order to address
this, we develop methods for analyzing, designing, and testing circuits subject to probabilistic
effects. Our main contributions are: 1) a fast, soft-error rate (SER) analyzer
that uses functional-simulation signatures to capture error effects, 2) novel design techniques
that improve reliability using little area and performance overhead, 3) a matrix-based
reliability-analysis framework that captures many types of probabilistic faults, and
4) test-generation/compaction methods aimed at probabilistic faults in logic circuits.
SER analysis must account for the main error-masking mechanisms in ICs: logic,
timing, and electrical masking. We relate logic masking to node testability of the circuit
and utilize functional-simulation signatures, i.e., partial truth tables, to efficiently compute
estability (signal probability and observability). To account for timing masking, we compute
error-latching windows (ELWs) from timing analysis information. Electrical masking
is incorporated into our estimates through derating factors for gate error probabilities. The
SER of a circuit is computed by combining the effects of all three masking mechanisms
within our SER analyzer called AnSER.
Using AnSER, we develop several low-overhead techniques that increase reliability,
including: 1) an SER-aware design method that uses redundancy already present within
the circuit, 2) a technique that resynthesizes small logic windows to improve area and
reliability, and 3) a post-placement gate-relocation technique that increases timing masking by decreasing ELWs.
We develop the probabilistic transfer matrix (PTM) modeling framework to analyze
effects beyond soft errors. PTMs are compressed into algebraic decision diagrams (ADDs)
to improve computational efficiency. Several ADD algorithms are developed to extract
reliability and error susceptibility information from PTMs representing circuits.
We propose new algorithms for circuit testing under probabilistic faults, which require
a reformulation of existing test techniques. For instance, a test vector may need to be
repeated many times to detect a fault. Also, different vectors detect the same fault with
different probabilities. We develop test generation methods that account for these differences, and integer linear programming (ILP) formulations to optimize test sets.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61584/1/smita_1.pd
- …