2,086 research outputs found

    Innovative Techniques for Testing and Diagnosing SoCs

    Get PDF
    We rely upon the continued functioning of many electronic devices for our everyday welfare, usually embedding integrated circuits that are becoming even cheaper and smaller with improved features. Nowadays, microelectronics can integrate a working computer with CPU, memories, and even GPUs on a single die, namely System-On-Chip (SoC). SoCs are also employed on automotive safety-critical applications, but need to be tested thoroughly to comply with reliability standards, in particular the ISO26262 functional safety for road vehicles. The goal of this PhD. thesis is to improve SoC reliability by proposing innovative techniques for testing and diagnosing its internal modules: CPUs, memories, peripherals, and GPUs. The proposed approaches in the sequence appearing in this thesis are described as follows: 1. Embedded Memory Diagnosis: Memories are dense and complex circuits which are susceptible to design and manufacturing errors. Hence, it is important to understand the fault occurrence in the memory array. In practice, the logical and physical array representation differs due to an optimized design which adds enhancements to the device, namely scrambling. This part proposes an accurate memory diagnosis by showing the efforts of a software tool able to analyze test results, unscramble the memory array, map failing syndromes to cell locations, elaborate cumulative analysis, and elaborate a final fault model hypothesis. Several SRAM memory failing syndromes were analyzed as case studies gathered on an industrial automotive 32-bit SoC developed by STMicroelectronics. The tool displayed defects virtually, and results were confirmed by real photos taken from a microscope. 2. Functional Test Pattern Generation: The key for a successful test is the pattern applied to the device. They can be structural or functional; the former usually benefits from embedded test modules targeting manufacturing errors and is only effective before shipping the component to the client. The latter, on the other hand, can be applied during mission minimally impacting on performance but is penalized due to high generation time. However, functional test patterns may benefit for having different goals in functional mission mode. Part III of this PhD thesis proposes three different functional test pattern generation methods for CPU cores embedded in SoCs, targeting different test purposes, described as follows: a. Functional Stress Patterns: Are suitable for optimizing functional stress during I Operational-life Tests and Burn-in Screening for an optimal device reliability characterization b. Functional Power Hungry Patterns: Are suitable for determining functional peak power for strictly limiting the power of structural patterns during manufacturing tests, thus reducing premature device over-kill while delivering high test coverage c. Software-Based Self-Test Patterns: Combines the potentiality of structural patterns with functional ones, allowing its execution periodically during mission. In addition, an external hardware communicating with a devised SBST was proposed. It helps increasing in 3% the fault coverage by testing critical Hardly Functionally Testable Faults not covered by conventional SBST patterns. An automatic functional test pattern generation exploiting an evolutionary algorithm maximizing metrics related to stress, power, and fault coverage was employed in the above-mentioned approaches to quickly generate the desired patterns. The approaches were evaluated on two industrial cases developed by STMicroelectronics; 8051-based and a 32-bit Power Architecture SoCs. Results show that generation time was reduced upto 75% in comparison to older methodologies while increasing significantly the desired metrics. 3. Fault Injection in GPGPU: Fault injection mechanisms in semiconductor devices are suitable for generating structural patterns, testing and activating mitigation techniques, and validating robust hardware and software applications. GPGPUs are known for fast parallel computation used in high performance computing and advanced driver assistance where reliability is the key point. Moreover, GPGPU manufacturers do not provide design description code due to content secrecy. Therefore, commercial fault injectors using the GPGPU model is unfeasible, making radiation tests the only resource available, but are costly. In the last part of this thesis, we propose a software implemented fault injector able to inject bit-flip in memory elements of a real GPGPU. It exploits a software debugger tool and combines the C-CUDA grammar to wisely determine fault spots and apply bit-flip operations in program variables. The goal is to validate robust parallel algorithms by studying fault propagation or activating redundancy mechanisms they possibly embed. The effectiveness of the tool was evaluated on two robust applications: redundant parallel matrix multiplication and floating point Fast Fourier Transform

    Fault-tolerant computer study

    Get PDF
    A set of building block circuits is described which can be used with commercially available microprocessors and memories to implement fault tolerant distributed computer systems. Each building block circuit is intended for VLSI implementation as a single chip. Several building blocks and associated processor and memory chips form a self checking computer module with self contained input output and interfaces to redundant communications buses. Fault tolerance is achieved by connecting self checking computer modules into a redundant network in which backup buses and computer modules are provided to circumvent failures. The requirements and design methodology which led to the definition of the building block circuits are discussed

    Behind the Last Line of Defense -- Surviving SoC Faults and Intrusions

    Get PDF
    Today, leveraging the enormous modular power, diversity and flexibility of manycore systems-on-a-chip (SoCs) requires careful orchestration of complex resources, a task left to low-level software, e.g. hypervisors. In current architectures, this software forms a single point of failure and worthwhile target for attacks: once compromised, adversaries gain access to all information and full control over the platform and the environment it controls. This paper proposes Midir, an enhanced manycore architecture, effecting a paradigm shift from SoCs to distributed SoCs. Midir changes the way platform resources are controlled, by retrofitting tile-based fault containment through well known mechanisms, while securing low-overhead quorum-based consensus on all critical operations, in particular privilege management and, thus, management of containment domains. Allowing versatile redundancy management, Midir promotes resilience for all software levels, including at low level. We explain this architecture, its associated algorithms and hardware mechanisms and show, for the example of a Byzantine fault tolerant microhypervisor, that it outperforms the highly efficient MinBFT by one order of magnitude

    Behind the Last Line of Defense -- Surviving SoC Faults and Intrusions

    Get PDF
    Today, leveraging the enormous modular power, diversity and flexibility of manycore systems-on-a-chip (SoCs) requires careful orchestration of complex resources, a task left to low-level software, e.g. hypervisors. In current architectures, this software forms a single point of failure and worthwhile target for attacks: once compromised, adversaries gain access to all information and full control over the platform and the environment it controls. This paper proposes Midir, an enhanced manycore architecture, effecting a paradigm shift from SoCs to distributed SoCs. Midir changes the way platform resources are controlled, by retrofitting tile-based fault containment through well known mechanisms, while securing low-overhead quorum-based consensus on all critical operations, in particular privilege management and, thus, management of containment domains. Allowing versatile redundancy management, Midir promotes resilience for all software levels, including at low level. We explain this architecture, its associated algorithms and hardware mechanisms and show, for the example of a Byzantine fault tolerant microhypervisor, that it outperforms the highly efficient MinBFT by one order of magnitude

    Fault Injection and Monitoring Capability for a Fault-Tolerant Distributed Computation System

    Get PDF
    The Configurable Fault-Injection and Monitoring System (CFIMS) is intended for the experimental characterization of effects caused by a variety of adverse conditions on a distributed computation system running flight control applications. A product of research collaboration between NASA Langley Research Center and Old Dominion University, the CFIMS is the main research tool for generating actual fault response data with which to develop and validate analytical performance models and design methodologies for the mitigation of fault effects in distributed flight control systems. Rather than a fixed design solution, the CFIMS is a flexible system that enables the systematic exploration of the problem space and can be adapted to meet the evolving needs of the research. The CFIMS has the capabilities of system-under-test (SUT) functional stimulus generation, fault injection and state monitoring, all of which are supported by a configuration capability for setting up the system as desired for a particular experiment. This report summarizes the work accomplished so far in the development of the CFIMS concept and documents the first design realization

    Resilience of an embedded architecture using hardware redundancy

    Get PDF
    In the last decade the dominance of the general computing systems market has being replaced by embedded systems with billions of units manufactured every year. Embedded systems appear in contexts where continuous operation is of utmost importance and failure can be profound. Nowadays, radiation poses a serious threat to the reliable operation of safety-critical systems. Fault avoidance techniques, such as radiation hardening, have been commonly used in space applications. However, these components are expensive, lag behind commercial components with regards to performance and do not provide 100% fault elimination. Without fault tolerant mechanisms, many of these faults can become errors at the application or system level, which in turn, can result in catastrophic failures. In this work we study the concepts of fault tolerance and dependability and extend these concepts providing our own definition of resilience. We analyse the physics of radiation-induced faults, the damage mechanisms of particles and the process that leads to computing failures. We provide extensive taxonomies of 1) existing fault tolerant techniques and of 2) the effects of radiation in state-of-the-art electronics, analysing and comparing their characteristics. We propose a detailed model of faults and provide a classification of the different types of faults at various levels. We introduce an algorithm of fault tolerance and define the system states and actions necessary to implement it. We introduce novel hardware and system software techniques that provide a more efficient combination of reliability, performance and power consumption than existing techniques. We propose a new element of the system called syndrome that is the core of a resilient architecture whose software and hardware can adapt to reliable and unreliable environments. We implement a software simulator and disassembler and introduce a testing framework in combination with ERA’s assembler and commercial hardware simulators

    Acta Cybernetica : Volume 16. Number 2.

    Get PDF

    Come On. I Need An Answer. A Mixed-Methods Study Of Barriers And Disparities In Diagnostic Odysseys

    Get PDF
    Background: The boom of next generation DNA sequencing over the past decade has improved our ability to provide accurate genetic diagnoses for children with previously undiagnosed diseases, in turn leading to important advances in management and prognostication. Even given this progress, two areas of ongoing need are the accurate definition of further novel genetic diseases and to make genetic expertise and diagnostics widely available to children and families who have frequently endured grueling diagnostic odysseys. The Pediatric Genomics Discovery Program (PGDP) at Yale is an advanced genomics program focusing on both these areas, enrolling over 700 patients since its inception and eventually providing approximately one-third with new genetic diagnoses. Despite this success, we questioned whether the PGDP was achieving its full potential for impact by reaching a broad, representative participant population. Hypothesis: Current PGDP participant demographics are not representative of the racial/ethnic and socioeconomic diversity in the community of patients with potentially undiagnosed genetic diseases, which may relate to systemic barriers along the diagnostic odyssey. Methods: We created a questionnaire and in-depth interview process for existing PGDP participants to evaluate barriers to diagnostic care, then analyzed transcripts for themes. We analyzed demographic characteristics and referral routes of the PGDP cohort to find factors related to recruitment. We developed a screening tool based on diagnostic codes and queried the Yale New Haven Health System (YNHHS) electronic health record (EHR) to identify inpatient children between 2017-2022 with potentially undiagnosed genetic conditions, estimate their prevalence, and compare their characteristics with those already enrolled in PGDP. Then, we manually reviewed patient charts further narrow patients down to those who likely had undiagnosed genetic diseases. We used Pearson chi-square for categorical data, a multinomial regression model for predictors of enrollment, and Kruskal-Wallis one-way analysis of variance with pairwise comparisons with Bonferroni correction for multiple comparisons. Results: Survey results noted 1) Not knowing the PGDP existed (42%) and 2) Not knowing if they qualified for PGDP (36%) as the most common barriers to participant enrollment. Qualitative interviews identified three overarching themes related to the search for a unifying medical diagnosis for patients and families: 1) Challenges along the diagnostic odyssey (largely barriers in the healthcare system), 2) Tools to navigate the uncertainty (particularly parent serving as a care-captain) and 3) Perceptions of the PGDP (having reservations about participating vs desire for a diagnosis). In the PGDP cohort analysis, being directly identified by a PGDP-affiliated physician was associated with the highest representation of URM (52%) compared to referrals through Yale Genetics (27%) or Other Referrals (16%), and a significantly greater URM representation compared to both the national pediatric population (p=0.008) and to a peer genetics program (

    Applying CBM and PHM concepts with reliability approach for Blowout Preventer (BOP): a literature review

    Get PDF
    The sensibility originated by the Blowout Preventer (BOP) theme, due to all attention gathered after the Macondo event, established a high level of requirements from regulatory agencies, clients and Drilling Contractors themselves. Based on these pillars, the concept of reliability has been constantly applied in the oil industry, especially in the Well Safety and Control System, where it is extremely important for the equipment to be reliable and operational when required. In parallel, the Condition Based Maintenance (CBM) and Prognostic Health Management (PHM) concepts, widely used in critical industries, which require high reliability levels, are being pointed out as the future for the BOP system management. Within this context, the purpose of this paper is to review the literature on Condition Based Maintenance and Prognostic Health Management, integrated with reliability concepts, and to enable them to be applied in the BOP health management. The paper identifies different concepts needed to support the main theme and, through research and selection criteria, it brings together a set of publications to obtain consistent theoretical framework. This research outlines important techniques used in high reliability industries and the way they can be applied on the BOP system and it also provides many useful references and case studies to assist on further development works in terms of well control and operational safety
    • …
    corecore