3,437 research outputs found

    Survivable algorithms and redundancy management in NASA's distributed computing systems

    Get PDF
    The design of survivable algorithms requires a solid foundation for executing them. While hardware techniques for fault-tolerant computing are relatively well understood, fault-tolerant operating systems, as well as fault-tolerant applications (survivable algorithms), are, by contrast, little understood, and much more work in this field is required. We outline some of our work that contributes to the foundation of ultrareliable operating systems and fault-tolerant algorithm design. We introduce our consensus-based framework for fault-tolerant system design. This is followed by a description of a hierarchical partitioning method for efficient consensus. A scheduler for redundancy management is introduced, and application-specific fault tolerance is described. We give an overview of our hybrid algorithm technique, which is an alternative to the formal approach given

    Lock-V: a heterogeneous fault tolerance architecture based on Arm and RISC-V

    Get PDF
    This article presents Lock-V, a heterogeneous fault tolerance architecture that explores a dual-core lockstep (DCLS) technique to mitigate single event upset (SEU) and common-mode failure (CMF) problems. The Lock-V was deployed in two versions, Lock-VA and Lock-VM by applying design diversity in two processor architectures at the instruction set architecture (ISA)-level. Lock-VA features an Arm Cortex-A9 with a RISC-V RV64GC, while Lock-VM includes an Arm Cortex-M3 along with a RISC-V RV32IMA processor. The solution explores fieldprogrammable gate array (FPGA) technology to deploy softcore versions of the RISC-V processors, and dedicated accelerators for performing error detection and triggering the software rollback system used for error recovery. To test Lock-V in both versions, a fault-injection mechanism was implemented to cause bit-flips in the processor registers, a common problem usually present in heavy radiation environments.This work has been supported by FCT - Fundação para a Ciência e a Tecnologia within the R&D Units Project Scope: UIDB/00319/2020

    Unattended network operations technology assessment study. Technical support for defining advanced satellite systems concepts

    Get PDF
    The results are summarized of an unattended network operations technology assessment study for the Space Exploration Initiative (SEI). The scope of the work included: (1) identified possible enhancements due to the proposed Mars communications network; (2) identified network operations on Mars; (3) performed a technology assessment of possible supporting technologies based on current and future approaches to network operations; and (4) developed a plan for the testing and development of these technologies. The most important results obtained are as follows: (1) addition of a third Mars Relay Satellite (MRS) and MRS cross link capabilities will enhance the network's fault tolerance capabilities through improved connectivity; (2) network functions can be divided into the six basic ISO network functional groups; (3) distributed artificial intelligence technologies will augment more traditional network management technologies to form the technological infrastructure of a virtually unattended network; and (4) a great effort is required to bring the current network technology levels for manned space communications up to the level needed for an automated fault tolerance Mars communications network

    Advanced flight control system study

    Get PDF
    The architecture, requirements, and system elements of an ultrareliable, advanced flight control system are described. The basic criteria are functional reliability of 10 to the minus 10 power/hour of flight and only 6 month scheduled maintenance. A distributed system architecture is described, including a multiplexed communication system, reliable bus controller, the use of skewed sensor arrays, and actuator interfaces. Test bed and flight evaluation program are proposed

    Real-time Prediction of Cascading Failures in Power Systems

    Get PDF
    Blackouts in power systems cause major financial and societal losses, which necessitate devising better prediction techniques that are specifically tailored to detecting and preventing them. Since blackouts begin as a cascading failure (CF), an early detection of these CFs gives the operators ample time to stop the cascade from propagating into a large-scale blackout. In this thesis, a real-time load-based prediction model for CFs using phasor measurement units (PMUs) is proposed. The proposed model provides load-based predictions; therefore, it has the advantages of being applicable as a controller input and providing the operators with better information about the affected regions. In addition, it can aid in visualizing the effects of the CF on the grid. To extend the functionality and robustness of the proposed model, prediction intervals are incorporated based on the convergence width criterion (CWC) to allow the model to account for the uncertainties of the network, which was not available in previous works. Although this model addresses many issues in previous works, it has limitations in both scalability and capturing of transient behaviours. Hence, a second model based on recurrent neural network (RNN) long short-term memory (LSTM) ensemble is proposed. The RNN-LSTM is added to better capture the dynamics of the power system while also giving faster responses. To accommodate for the scalability of the model, a novel selection criterion for inputs is introduced to minimize the inputs while maintaining a high information entropy. The criteria include distance between buses as per graph theory, centrality of the buses with respect to fault location, and the information entropy of the bus. These criteria are merged using higher statistical moments to reflect the importance of each bus and generate indices that describe the grid with a smaller set of inputs. The results indicate that this model has the potential to provide more meaningful and accurate results than what is available in the previous literature and can be used as part of the integrated remedial action scheme (RAS) system either as a warning tool or a controller input as the accuracy of detecting affected regions reached 99.9% with a maximum delay of 400 ms. Finally, a validation loop extension is introduced to allow the model to self-update in real-time using importance sampling and case-based reasoning to extend the practicality of the model by allowing it to learn from historical data as time progresses

    Advanced flight control system study

    Get PDF
    A fly by wire flight control system architecture designed for high reliability includes spare sensor and computer elements to permit safe dispatch with failed elements, thereby reducing unscheduled maintenance. A methodology capable of demonstrating that the architecture does achieve the predicted performance characteristics consists of a hierarchy of activities ranging from analytical calculations of system reliability and formal methods of software verification to iron bird testing followed by flight evaluation. Interfacing this architecture to the Lockheed S-3A aircraft for flight test is discussed. This testbed vehicle can be expanded to support flight experiments in advanced aerodynamics, electromechanical actuators, secondary power systems, flight management, new displays, and air traffic control concepts

    High-level services for networks-on-chip

    Get PDF
    Future technology trends envision that next-generation Multiprocessors Systems-on- Chip (MPSoCs) will be composed of a combination of a large number of processing and storage elements interconnected by complex communication architectures. Communication and interconnection between these basic blocks play a role of crucial importance when the number of these elements increases. Enabling reliable communication channels between cores becomes therefore a challenge for system designers. Networks-on-Chip (NoCs) appeared as a strategy for connecting and managing the communication between several design elements and IP blocks, as required in complex Systems-on-Chip (SoCs). The topic can be considered as a multidisciplinary synthesis of multiprocessing, parallel computing, networking, and on- chip communication domains. Networks-on-Chip, in addition to standard communication services, can be employed for providing support for the implementation of system-level services. This dissertation will demonstrate how high-level services can be added to an MPSoC platform by embedding appropriate hardware/software support in the network interfaces (NIs) of the NoC. In this dissertation, the implementation of innovative modules acting in parallel with protocol translation and data transmission in NIs is proposed and evaluated. The modules can support the execution of the high-level services in the NoC at a relatively low cost in terms of area and energy consumption. Three types of services will be addressed and discussed: security, monitoring, and fault tolerance. With respect to the security aspect, this dissertation will discuss the implementation of an innovative data protection mechanism for detecting and preventing illegal accesses to protected memory blocks and/or memory mapped peripherals. The second aspect will be addressed by proposing the implementation of a monitoring system based on programmable multipurpose monitoring probes aimed at detecting NoC internal events and run-time characteristics. As last topic, new architectural solutions for the design of fault tolerant network interfaces will be presented and discussed

    Scan Test Coverage Improvement Via Automatic Test Pattern Generation (Atpg) Tool Configuration

    Get PDF
    The scan test coverage improvement by using automatic test pattern generation (ATPG) tool configuration was investigated. Improving the test coverage is essential in detecting manufacturing defects in semiconductor industry so that high quality products can be supplied to consumers. The ATPG tool used was Mentor Graphics Tessent TestKompress (version 2014.1). The study was done by setting up a few experiments of utilizing and modifying ATPG commands and switches, observing the test coverage improvement from the statistical reports provided during pattern generation process and providing relatable discussions. By modifying the ATPG commands, it can be expected to have some improvement in the test coverage. The scan test patterns generated were stuck-at test patterns. Based on the experiments done, comparison was made on the different coverage readings and the most optimized method and flow of ATPG were determined. The most optimized flow gave an improvement of 0.91% in test coverage which is acceptable since this method does not involve a change in design. The test patterns generated were converted and tested using automatic test equipment (ATE) to observe its performance on real silicon. The test coverage improvement using ATPG tool instead of the design-based method is important as a faster workaround for back-end engineers to provide high quality test contents in such a short product development duration

    Energy-Efficient and Reliable Computing in Dark Silicon Era

    Get PDF
    Dark silicon denotes the phenomenon that, due to thermal and power constraints, the fraction of transistors that can operate at full frequency is decreasing in each technology generation. Moore’s law and Dennard scaling had been backed and coupled appropriately for five decades to bring commensurate exponential performance via single core and later muti-core design. However, recalculating Dennard scaling for recent small technology sizes shows that current ongoing multi-core growth is demanding exponential thermal design power to achieve linear performance increase. This process hits a power wall where raises the amount of dark or dim silicon on future multi/many-core chips more and more. Furthermore, from another perspective, by increasing the number of transistors on the area of a single chip and susceptibility to internal defects alongside aging phenomena, which also is exacerbated by high chip thermal density, monitoring and managing the chip reliability before and after its activation is becoming a necessity. The proposed approaches and experimental investigations in this thesis focus on two main tracks: 1) power awareness and 2) reliability awareness in dark silicon era, where later these two tracks will combine together. In the first track, the main goal is to increase the level of returns in terms of main important features in chip design, such as performance and throughput, while maximum power limit is honored. In fact, we show that by managing the power while having dark silicon, all the traditional benefits that could be achieved by proceeding in Moore’s law can be also achieved in the dark silicon era, however, with a lower amount. Via the track of reliability awareness in dark silicon era, we show that dark silicon can be considered as an opportunity to be exploited for different instances of benefits, namely life-time increase and online testing. We discuss how dark silicon can be exploited to guarantee the system lifetime to be above a certain target value and, furthermore, how dark silicon can be exploited to apply low cost non-intrusive online testing on the cores. After the demonstration of power and reliability awareness while having dark silicon, two approaches will be discussed as the case study where the power and reliability awareness are combined together. The first approach demonstrates how chip reliability can be used as a supplementary metric for power-reliability management. While the second approach provides a trade-off between workload performance and system reliability by simultaneously honoring the given power budget and target reliability
    corecore