Search CORE

2,224 research outputs found

Experimental analysis of computer system dependability

Author: Iyer Ravishankar, K.
Tang Dong
Publication venue
Publication date
Field of study

This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance

NASA Technical Reports Server

Study of Single Event Transient Error Mitigation

Author: Xie Hao 1988-
Publication venue: 'University of Saskatchewan Library'
Publication date: 17/08/2017
Field of study

Single Event Transient (SET) errors in ground-level electronic devices are a growing concern in the radiation hardening field. However, effective SET mitigation technologies which satisfy ground-level demands such as generic, flexible, efficient, and fast, are limited. The classic Triple Modular Redundancy (TMR) method is the most well-known and popular technique in space and nuclear environment. But it leads to more than 200% area and power overheads, which is too costly to implement in ground-level applications. Meanwhile, the coding technique is extensively utilized to inhibit upset errors in storage cells, but the irregularity of combinatorial logics limits its use in SET mitigation. Therefore, SET mitigation techniques suitable for ground-level applications need to be addressed. Aware of the demands for SET mitigation techniques in ground-level applications, this thesis proposes two novel approaches based on the redundant wire and approximate logic techniques. The Redundant Wire is a SET mitigation technique. By selectively adding redundant wire connections, the technique can prohibit targeted transient faults from propagating on the fly. This thesis proposes a set of signature-based evaluation equations to efficiently estimate the protecting effect provided by each redundant wire candidates. Based on the estimated results, a greedy algorithm is used to insert the best candidate repeatedly. Simulation results substantiate that the evaluation equations can achieve up to 98% accuracy on average. Regarding protecting effects, the technique can mask 18.4% of the faults with a 4.3% area, 4.4% power, and 5.4% delay overhead on average. Overall, the quality of protecting results obtained are 2.8 times better than the previous work. Additionally, the impact of synthesis constraints and signature length are discussed. Approximate Logic is a partial TMR technique offering a trade-off between fault coverage and area overheads. The approximate logic consists of an under-approximate logic and an over-approximate logic. The under-approximate logic is a subset of the original min-terms and the over-approximate logic is a subset of the original max-terms. This thesis proposes a new algorithm for generating the two approximate logics. Through the generating process, the algorithm considers the intrinsic failure probabilities of each gate and utilizes a confidence interval estimate equation to minimize required computations. The technique is applied to two fault models, Stuck-at and SET, and the separate results are compared and discussed. The results show that the technique can reduce the error 75% with an area penalty of 46% on some circuits. The delay overheads of this technique are always two additional layers of logic. The two proposed SET mitigation techniques are both applicable to generic combinatorial logics and with high flexibility. The simulation shows promising SET mitigation ability. The proposed mitigation techniques provide designers more choices in developing reliable combinatorial logic in ground-level applications

eCommons@USASK

University of Saskatchewan Research Archive

Timing speculation and adaptive reliable overclocking techniques for aggressive computer systems

Author: Subramanian Viswanathan
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2009
Field of study

Computers have changed our lives beyond our own imagination in the past several decades. The continued and progressive advancements in VLSI technology and numerous micro-architectural innovations have played a key role in the design of spectacular low-cost high performance computing systems that have become omnipresent in today\u27s technology driven world. Performance and dependability have become key concerns as these ubiquitous computing machines continue to drive our everyday life. Every application has unique demands, as they run in diverse operating environments. Dependable, aggressive and adaptive systems improve efficiency in terms of speed, reliability and energy consumption. Traditional computing systems run at a fixed clock frequency, which is determined by taking into account the worst-case timing paths, operating conditions, and process variations. Timing speculation based reliable overclocking advocates going beyond worst-case limits to achieve best performance while not avoiding, but detecting and correcting a modest number of timing errors. The success of this design methodology relies on the fact that timing critical paths are rarely exercised in a design, and typical execution happens much faster than the timing requirements dictated by worst-case design methodology. Better-than-worst-case design methodology is advocated by several recent research pursuits, which exploit dependability techniques to enhance computer system performance. In this dissertation, we address different aspects of timing speculation based adaptive reliable overclocking schemes, and evaluate their role in the design of low-cost, high performance, energy efficient and dependable systems. We visualize various control knobs in the design that can be favorably controlled to ensure different design targets. As part of this research, we extend the SPRIT3E, or Superscalar PeRformance Improvement Through Tolerating Timing Errors, framework, and characterize the extent of application dependent performance acceleration achievable in superscalar processors by scrutinizing the various parameters that impact the operation beyond worst-case limits. We study the limitations imposed by short-path constraints on our technique, and present ways to exploit them to maximize performance gains. We analyze the sensitivity of our technique\u27s adaptiveness by exploring the necessary hardware requirements for dynamic overclocking schemes. Experimental analysis based on SPEC2000 benchmarks running on a SimpleScalar Alpha processor simulator, augmented with error rate data obtained from hardware simulations of a superscalar processor, are presented. Even though reliable overclocking guarantees functional correctness, it leads to higher power consumption. As a consequence, reliable overclocking without considering on-chip temperatures will bring down the lifetime reliability of the chip. In this thesis, we analyze how reliable overclocking impacts the on-chip temperature of a microprocessor and evaluate the effects of overheating, due to such reliable dynamic frequency tuning mechanisms, on the lifetime reliability of these systems. We then evaluate the effect of performing thermal throttling, a technique that clamps the on-chip temperature below a predefined value, on system performance and reliability. Our study shows that a reliably overclocked system with dynamic thermal management achieves 25% performance improvement, while lasting for 14 years when being operated within 353K. Over the past five decades, technology scaling, as predicted by Moore\u27s law, has been the bedrock of semiconductor technology evolution. The continued downscaling of CMOS technology to deep sub-micron gate lengths has been the primary reason for its dominance in today\u27s omnipresent silicon microchips. Even as the transition to the next technology node is indispensable, the initial cost and time associated in doing so presents a non-level playing field for the competitors in the semiconductor business. As part of this thesis, we evaluate the capability of speculative reliable overclocking mechanisms to maximize performance at a given technology level. We evaluate its competitiveness when compared to technology scaling, in terms of performance, power consumption, energy and energy delay product. We present a comprehensive comparison for integer and floating point SPEC2000 benchmarks running on a simulated Alpha processor at three different technology nodes in normal and enhanced modes. Our results suggest that adopting reliable overclocking strategies will help skip a technology node altogether, or be competitive in the market, while porting to the next technology node. Reliability has become a serious concern as systems embrace nanometer technologies. In this dissertation, we propose a novel fault tolerant aggressive system that combines soft error protection and timing error tolerance. We replicate both the pipeline registers and the pipeline stage combinational logic. The replicated logic receives its inputs from the primary pipeline registers while writing its output to the replicated pipeline registers. The organization of redundancy in the proposed Conjoined Pipeline system supports overclocking, provides concurrent error detection and recovery capability for soft errors, intermittent faults and timing errors, and flags permanent silicon defects. The fast recovery process requires no checkpointing and takes three cycles. Back annotated post-layout gate-level timing simulations, using 45nm technology, of a conjoined two-stage arithmetic pipeline and a conjoined five-stage DLX pipeline processor, with forwarding logic, show that our approach, even under a severe fault injection campaign, achieves near 100% fault coverage and an average performance improvement of about 20%, when dynamically overclocked

Digital Repository @ Iowa State University (ISU)

A Model-Based Soft Errors Risks Minimization Approach

Author: Myers Doug
Ortega-Sanchez Cesar
Sadi M.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

Minimizing the risk of system failure in any computer structure requires identifying those components whose failure is likely to impact on system functionality. Clearly, the degree of protection or prevention required against faults is not the same for all components. Tolerating soft errors can be much improved if critical components can be identified at an early design phase and measures are taken to lower their criticalities at that stage. This improvement is achieved by presenting a criticality ranking (among the components) formed by combining a prediction of faults, consequences of them, and a propagation of errors at the system modeling phase; and pointing out ways to apply changes in the model to minimize the risk of degradation of desired functionalities. Case study results are given to validate the approach

espace@Curtin

Cross layer reliability estimation for digital systems

Author: Vallero Alessandro
Publication venue: Politecnico di Torino
Publication date: 01/01/2017
Field of study

Forthcoming manufacturing technologies hold the promise to increase multifuctional computing systems performance and functionality thanks to a remarkable growth of the device integration density. Despite the benefits introduced by this technology improvements, reliability is becoming a key challenge for the semiconductor industry. With transistor size reaching the atomic dimensions, vulnerability to unavoidable fluctuations in the manufacturing process and environmental stress rise dramatically. Failing to meet a reliability requirement may add excessive re-design cost to recover and may have severe consequences on the success of a product. %Worst-case design with large margins to guarantee reliable operation has been employed for long time. However, it is reaching a limit that makes it economically unsustainable due to its performance, area, and power cost. One of the open challenges for future technologies is building ``dependable'' systems on top of unreliable components, which will degrade and even fail during normal lifetime of the chip. Conventional design techniques are highly inefficient. They expend significant amount of energy to tolerate the device unpredictability by adding safety margins to a circuit's operating voltage, clock frequency or charge stored per bit. Unfortunately, the additional cost introduced to compensate unreliability are rapidly becoming unacceptable in today's environment where power consumption is often the limiting factor for integrated circuit performance, and energy efficiency is a top concern. Attention should be payed to tailor techniques to improve the reliability of a system on the basis of its requirements, ending up with cost-effective solutions favoring the success of the product on the market. Cross-layer reliability is one of the most promising approaches to achieve this goal. Cross-layer reliability techniques take into account the interactions between the layers composing a complex system (i.e., technology, hardware and software layers) to implement efficient cross-layer fault mitigation mechanisms. Fault tolerance mechanism are carefully implemented at different layers starting from the technology up to the software layer to carefully optimize the system by exploiting the inner capability of each layer to mask lower level faults. For this purpose, cross-layer reliability design techniques need to be complemented with cross-layer reliability evaluation tools, able to precisely assess the reliability level of a selected design early in the design cycle. Accurate and early reliability estimates would enable the exploration of the system design space and the optimization of multiple constraints such as performance, power consumption, cost and reliability. This Ph.D. thesis is devoted to the development of new methodologies and tools to evaluate and optimize the reliability of complex digital systems during the early design stages. More specifically, techniques addressing hardware accelerators (i.e., FPGAs and GPUs), microprocessors and full systems are discussed. All developed methodologies are presented in conjunction with their application to real-world use cases belonging to different computational domains

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Machine Learning to Tackle the Challenges of Transient and Soft Errors in Complex Circuits

Author: Alexandrescu Dan
Balakrishnan Aneesh
Glorieux Maximilien
Lange Thomas
Sterpone Luca
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 18/02/2020
Field of study

The Functional Failure Rate analysis of today's complex circuits is a difficult task and requires a significant investment in terms of human efforts, processing resources and tool licenses. Thereby, de-rating or vulnerability factors are a major instrument of failure analysis efforts. Usually computationally intensive fault-injection simulation campaigns are required to obtain a fine-grained reliability metrics for the functional level. Therefore, the use of machine learning algorithms to assist this procedure and thus, optimising and enhancing fault injection efforts, is investigated in this paper. Specifically, machine learning models are used to predict accurate per-instance Functional De-Rating data for the full list of circuit instances, an objective that is difficult to reach using classical methods. The described methodology uses a set of per-instance features, extracted through an analysis approach, combining static elements (cell properties, circuit structure, synthesis attributes) and dynamic elements (signal activity). Reference data is obtained through first-principles fault simulation approaches. One part of this reference dataset is used to train the machine learning model and the remaining is used to validate and benchmark the accuracy of the trained tool. The presented methodology is applied on a practical example and various machine learning models are evaluated and compared

arXiv.org e-Print Archive

Crossref

A Novel Approach to Minimizing the Risks of Soft Errors in Mobile and Ubiquitous Systems

Author: Myers Doug
Ortega-Sanchez Cesar
Sadi M.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

A novel approach to minimizing the risks of soft errors at modelling level of mobile and ubiquitous systems is outlined. From a pure dependability viewpoint, critical components, whose failure is likely to impact on system functionality, attract more attention of protection/prevention mechanisms (against soft errors) than others do. Tolerating soft errors can be much improved if critical components can be identified at an early design phase and measures are taken to lower their criticalities at that stage. This improvement is achieved by presenting a criticality ranking (among the components) formed by combining a prediction of soft errors, consequences of them, and a propagation of failures at system modelling phase; and pointing out the ways to apply changes in the model to minimize the risks of degradation of desired functionalities. Case study results are given to illustrate and validate the approach

espace@Curtin

Fault Modeling of Graphene Nanoribbon FET Logic Circuits

Author: Gil Tomás Daniel Antonio
Gil Pedro
Gracia-Morán Joaquín
Saiz-Adalid Luis-J.
Publication venue: 'MDPI AG'
Publication date: 31/07/2019
Field of study

[EN] Due to the increasing defect rates in highly scaled complementary metal-oxide-semiconductor (CMOS) devices, and the emergence of alternative nanotechnology devices, reliability challenges are of growing importance. Understanding and controlling the fault mechanisms associated with new materials and structures for both transistors and interconnection is a key issue in novel nanodevices. The graphene nanoribbon field-effect transistor (GNR FET) has revealed itself as a promising technology to design emerging research logic circuits, because of its outstanding potential speed and power properties. This work presents a study of fault causes, mechanisms, and models at the device level, as well as their impact on logic circuits based on GNR FETs. From a literature review of fault causes and mechanisms, fault propagation was analyzed, and fault models were derived for device and logic circuit levels. This study may be helpful for the prevention of faults in the design process of graphene nanodevices. In addition, it can help in the design and evaluation of defect- and fault-tolerant nanoarchitectures based on graphene circuits. Results are compared with other emerging devices, such as carbon nanotube (CNT) FET and nanowire (NW) FET.This work was supported in part by the Spanish Government under the research project TIN2016-81075-R and by Primeros Proyectos de Investigacion (PAID-06-18), Vicerrectorado de Investigacion, Innovacion y Transferencia de la Universitat Politecnica de Valencia (UPV), under the project 200190032.Gil Tomás, DA.; Gracia-Morán, J.; Saiz-Adalid, L.; Gil, P. (2019). Fault Modeling of Graphene Nanoribbon FET Logic Circuits. Electronics. 8(8):1-18. https://doi.org/10.3390/electronics8080851S11888International Technology Roadmap for Semiconductors (ITRS) 2013http://www.itrs2.net/2013-itrs.htmlSchuegraf, K., Abraham, M. C., Brand, A., Naik, M., & Thakur, R. (2013). Semiconductor Logic Technology Innovation to Achieve Sub-10 nm Manufacturing. IEEE Journal of the Electron Devices Society, 1(3), 66-75. doi:10.1109/jeds.2013.2271582International Technology Roadmap for Semiconductors (ITRS) 2015https://bit.ly/2xiiT8PNovoselov, K. S. (2004). Electric Field Effect in Atomically Thin Carbon Films. Science, 306(5696), 666-669. doi:10.1126/science.1102896Geim, A. K., & Novoselov, K. S. (2007). The rise of graphene. Nature Materials, 6(3), 183-191. doi:10.1038/nmat1849Wu, Y., Farmer, D. B., Xia, F., & Avouris, P. (2013). Graphene Electronics: Materials, Devices, and Circuits. Proceedings of the IEEE, 101(7), 1620-1637. doi:10.1109/jproc.2013.2260311Choudhury, M. R., Youngki Yoon, Jing Guo, & Mohanram, K. (2011). Graphene Nanoribbon FETs: Technology Exploration for Performance and Reliability. IEEE Transactions on Nanotechnology, 10(4), 727-736. doi:10.1109/tnano.2010.2073718Avouris, P. (2010). Graphene: Electronic and Photonic Properties and Devices. Nano Letters, 10(11), 4285-4294. doi:10.1021/nl102824hBanadaki, Y. M., & Srivastava, A. (2015). Scaling Effects on Static Metrics and Switching Attributes of Graphene Nanoribbon FET for Emerging Technology. IEEE Transactions on Emerging Topics in Computing, 3(4), 458-469. doi:10.1109/tetc.2015.2445104Avouris, P., Chen, Z., & Perebeinos, V. (2007). Carbon-based electronics. Nature Nanotechnology, 2(10), 605-615. doi:10.1038/nnano.2007.300Banerjee, S. K., Register, L. F., Tutuc, E., Basu, D., Kim, S., Reddy, D., & MacDonald, A. H. (2010). Graphene for CMOS and Beyond CMOS Applications. Proceedings of the IEEE, 98(12), 2032-2046. doi:10.1109/jproc.2010.2064151Schwierz, F. (2013). Graphene Transistors: Status, Prospects, and Problems. Proceedings of the IEEE, 101(7), 1567-1584. doi:10.1109/jproc.2013.2257633Fregonese, S., Magallo, M., Maneux, C., Happy, H., & Zimmer, T. (2013). Scalable Electrical Compact Modeling for Graphene FET Transistors. IEEE Transactions on Nanotechnology, 12(4), 539-546. doi:10.1109/tnano.2013.2257832Chen, Y.-Y., Sangai, A., Rogachev, A., Gholipour, M., Iannaccone, G., Fiori, G., & Chen, D. (2015). A SPICE-Compatible Model of MOS-Type Graphene Nano-Ribbon Field-Effect Transistors Enabling Gate- and Circuit-Level Delay and Power Analysis Under Process Variation. IEEE Transactions on Nanotechnology, 14(6), 1068-1082. doi:10.1109/tnano.2015.2469647Ferrari, A. C., Bonaccorso, F., Fal’ko, V., Novoselov, K. S., Roche, S., Bøggild, P., … Pugno, N. (2015). Science and technology roadmap for graphene, related two-dimensional crystals, and hybrid systems. Nanoscale, 7(11), 4598-4810. doi:10.1039/c4nr01600aHong, A. J., Song, E. B., Yu, H. S., Allen, M. J., Kim, J., Fowler, J. D., … Wang, K. L. (2011). Graphene Flash Memory. ACS Nano, 5(10), 7812-7817. doi:10.1021/nn201809kJeng, S.-L., Lu, J.-C., & Wang, K. (2007). A Review of Reliability Research on Nanotechnology. IEEE Transactions on Reliability, 56(3), 401-410. doi:10.1109/tr.2007.903188Srinivasu, B., & Sridharan, K. (2017). A Transistor-Level Probabilistic Approach for Reliability Analysis of Arithmetic Circuits With Applications to Emerging Technologies. IEEE Transactions on Reliability, 66(2), 440-457. doi:10.1109/tr.2016.2642168Teixeira Franco, D., Naviner, J.-F., & Naviner, L. (2006). Yield and reliability issues in nanoelectronic technologies. annals of telecommunications - annales des télécommunications, 61(11-12), 1422-1457. doi:10.1007/bf03219903Lin, Y.-M., Jenkins, K. A., Valdes-Garcia, A., Small, J. P., Farmer, D. B., & Avouris, P. (2009). Operation of Graphene Transistors at Gigahertz Frequencies. Nano Letters, 9(1), 422-426. doi:10.1021/nl803316hLiao, L., Lin, Y.-C., Bao, M., Cheng, R., Bai, J., Liu, Y., … Duan, X. (2010). High-speed graphene transistors with a self-aligned nanowire gate. Nature, 467(7313), 305-308. doi:10.1038/nature09405Wang, X., Tabakman, S. M., & Dai, H. (2008). Atomic Layer Deposition of Metal Oxides on Pristine and Functionalized Graphene. Journal of the American Chemical Society, 130(26), 8152-8153. doi:10.1021/ja8023059Geim, A. K. (2009). Graphene: Status and Prospects. Science, 324(5934), 1530-1534. doi:10.1126/science.1158877Mistewicz, K., Nowak, M., Wrzalik, R., Śleziona, J., Wieczorek, J., & Guiseppi-Elie, A. (2016). Ultrasonic processing of SbSI nanowires for their application to gas sensors. Ultrasonics, 69, 67-73. doi:10.1016/j.ultras.2016.04.004Jesionek, M., Nowak, M., Mistewicz, K., Kępińska, M., Stróż, D., Bednarczyk, I., & Paszkiewicz, R. (2018). Sonochemical growth of nanomaterials in carbon nanotube. Ultrasonics, 83, 179-187. doi:10.1016/j.ultras.2017.03.014Chen, X., Seo, D. H., Seo, S., Chung, H., & Wong, H.-S. P. (2012). Graphene Interconnect Lifetime: A Reliability Analysis. IEEE Electron Device Letters, 33(11), 1604-1606. doi:10.1109/led.2012.2211564Wang, Z. F., Zheng, H., Shi, Q. W., & Chen, J. (2009). Emerging nanodevice paradigm. ACM Journal on Emerging Technologies in Computing Systems, 5(1), 1-19. doi:10.1145/1482613.1482616Dong, J., Xiang, G., Xiang-Yang, K., & Jia-Ming, L. (2007). Atomistic Failure Mechanism of Single Wall Carbon Nanotubes with Small Diameters. Chinese Physics Letters, 24(1), 165-168. doi:10.1088/0256-307x/24/1/045Bu, H., Chen, Y., Zou, M., Yi, H., Bi, K., & Ni, Z. (2009). Atomistic simulations of mechanical properties of graphene nanoribbons. Physics Letters A, 373(37), 3359-3362. doi:10.1016/j.physleta.2009.07.04

Multidisciplinary Digital Publishing Institute

Crossref

RiuNet

Characterizing the Effects of Intermittent Faults on a Processor for Dependability Enhancement Strategy

Author: Chao(Saul) Wang
Dong-Sheng Wang
Hong-Song Chen
Zhong-Chuan Fu
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2014
Field of study

As semiconductor technology scales into the nanometer regime, intermittent faults have become an increasing threat. This paper focuses on the effects of intermittent faults on NET versus REG on one hand and the implications for dependability strategy on the other. First, the vulnerability characteristics of representative units in OpenSPARC T2 are revealed, and in particular, the highly sensitive modules are identified. Second, an arch-level dependability enhancement strategy is proposed, showing that events such as core/strand running status and core-memory interface events can be candidates of detectable symptoms. A simple watchdog can be deployed to detect application running status (IEXE event). Then SDC (silent data corruption) rate is evaluated demonstrating its potential. Third and last, the effects of traditional protection schemes in the target CMT to intermittent faults are quantitatively studied on behalf of the contribution of each trap type, demonstrating the necessity of taking this factor into account for the strategy

Crossref

Directory of Open Access Journals

PubMed Central

Innovative Techniques for Testing and Diagnosing SoCs

Author: DE CARVALHO Mauricio
Publication venue: Politecnico di Torino
Publication date: 01/01/2015
Field of study

We rely upon the continued functioning of many electronic devices for our everyday welfare, usually embedding integrated circuits that are becoming even cheaper and smaller with improved features. Nowadays, microelectronics can integrate a working computer with CPU, memories, and even GPUs on a single die, namely System-On-Chip (SoC). SoCs are also employed on automotive safety-critical applications, but need to be tested thoroughly to comply with reliability standards, in particular the ISO26262 functional safety for road vehicles. The goal of this PhD. thesis is to improve SoC reliability by proposing innovative techniques for testing and diagnosing its internal modules: CPUs, memories, peripherals, and GPUs. The proposed approaches in the sequence appearing in this thesis are described as follows: 1. Embedded Memory Diagnosis: Memories are dense and complex circuits which are susceptible to design and manufacturing errors. Hence, it is important to understand the fault occurrence in the memory array. In practice, the logical and physical array representation differs due to an optimized design which adds enhancements to the device, namely scrambling. This part proposes an accurate memory diagnosis by showing the efforts of a software tool able to analyze test results, unscramble the memory array, map failing syndromes to cell locations, elaborate cumulative analysis, and elaborate a final fault model hypothesis. Several SRAM memory failing syndromes were analyzed as case studies gathered on an industrial automotive 32-bit SoC developed by STMicroelectronics. The tool displayed defects virtually, and results were confirmed by real photos taken from a microscope. 2. Functional Test Pattern Generation: The key for a successful test is the pattern applied to the device. They can be structural or functional; the former usually benefits from embedded test modules targeting manufacturing errors and is only effective before shipping the component to the client. The latter, on the other hand, can be applied during mission minimally impacting on performance but is penalized due to high generation time. However, functional test patterns may benefit for having different goals in functional mission mode. Part III of this PhD thesis proposes three different functional test pattern generation methods for CPU cores embedded in SoCs, targeting different test purposes, described as follows: a. Functional Stress Patterns: Are suitable for optimizing functional stress during I Operational-life Tests and Burn-in Screening for an optimal device reliability characterization b. Functional Power Hungry Patterns: Are suitable for determining functional peak power for strictly limiting the power of structural patterns during manufacturing tests, thus reducing premature device over-kill while delivering high test coverage c. Software-Based Self-Test Patterns: Combines the potentiality of structural patterns with functional ones, allowing its execution periodically during mission. In addition, an external hardware communicating with a devised SBST was proposed. It helps increasing in 3% the fault coverage by testing critical Hardly Functionally Testable Faults not covered by conventional SBST patterns. An automatic functional test pattern generation exploiting an evolutionary algorithm maximizing metrics related to stress, power, and fault coverage was employed in the above-mentioned approaches to quickly generate the desired patterns. The approaches were evaluated on two industrial cases developed by STMicroelectronics; 8051-based and a 32-bit Power Architecture SoCs. Results show that generation time was reduced upto 75% in comparison to older methodologies while increasing significantly the desired metrics. 3. Fault Injection in GPGPU: Fault injection mechanisms in semiconductor devices are suitable for generating structural patterns, testing and activating mitigation techniques, and validating robust hardware and software applications. GPGPUs are known for fast parallel computation used in high performance computing and advanced driver assistance where reliability is the key point. Moreover, GPGPU manufacturers do not provide design description code due to content secrecy. Therefore, commercial fault injectors using the GPGPU model is unfeasible, making radiation tests the only resource available, but are costly. In the last part of this thesis, we propose a software implemented fault injector able to inject bit-flip in memory elements of a real GPGPU. It exploits a software debugger tool and combines the C-CUDA grammar to wisely determine fault spots and apply bit-flip operations in program variables. The goal is to validate robust parallel algorithms by studying fault propagation or activating redundancy mechanisms they possibly embed. The effectiveness of the tool was evaluated on two robust applications: redundant parallel matrix multiplication and floating point Fast Fourier Transform

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino