Search CORE

225 research outputs found

Injecting Intermittent Faults for the Dependability Assessment of a Fault-Tolerant Microcomputer System

Author: Baraza Calvo Juan Carlos
Gil Tomás Daniel Antonio
Gil Vicente Pedro Joaquín
Gracia Morán Joaquín
Saiz Adalid Luis José
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/06/2016
Field of study

© 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.As scaling is more and more aggressive, intermittent faults are increasing their importance in current deep submicron complementary metal-oxide-semiconductor (CMOS) technologies. This work shows the dependability assessment of a fault-tol- erant computer system against intermittent faults. The applied methodology lies in VHDL-based fault injection, which allows the assessment in early design phases, together with a high level of observability and controllability. The evaluated system is a duplex microcontroller system with cold stand-by sparing. A wide set of intermittent fault models have been injected, and from the simulation traces, coverages and latencies have been measured. Markov models for this system have been generated and some dependability functions, such as reliability and safety, have been calculated. From these results, some enhancements of detection and recovery mechanisms have been suggested. The methodology presented is general to any fault-tolerant computer system.This work was supported in part by the Universitat Politecnica de Valencia under the Research Project SP20120806, and in part by the Spanish Government under the Research Project TIN2012-38308-C02-01. Associate Editor: J. Shortle.Gil Tomás, DA.; Gracia Morán, J.; Baraza Calvo, JC.; Saiz Adalid, LJ.; Gil Vicente, PJ. (2016). Injecting Intermittent Faults for the Dependability Assessment of a Fault-Tolerant Microcomputer System. IEEE Transactions on Reliability. 65(2):648-661. https://doi.org/10.1109/TR.2015.2484058S64866165

RiuNet

Effects of intermittent faults on the reliability of a Reduced Instruction Set Computing (RISC) microprocessor

Author: Baraza Calvo Juan Carlos
Gil Tomás Daniel Antonio
Gil Pedro
Gracia-Morán Joaquín
Saiz-Adalid Luis-J.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 24/01/2014
Field of study

© 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.With the scaling of complementary metal-oxide-semiconductor (CMOS) technology to the submicron range, designers have to deal with a growing number and variety of fault types. In this way, intermittent faults are gaining importance in modern very large scale integration (VLSI) circuits. The presence of these faults is increasing due to the complexity of manufacturing processes (which produce residues and parameter variations), together with special aging mechanisms. This work presents a case study of the impact of intermittent faults on the behavior of a reduced instruction set computing (RISC) microprocessor. We have carried out an exhaustive reliability assessment by using very-high-speed-integrated-circuit hardware description language (VHDL)-based fault injection. In this way, we have been able to modify different intermittent fault parameters, to select various targets, and even, to compare the impact of intermittent faults with those induced by transient and permanent faults.This work was supported by the Spanish Government under the Research Project TIN2009-13825 and by the Universitat Politecnica de Valencia under the Project SP20120806. Associate Editor: L. Cui.Gracia-Morán, J.; Baraza Calvo, JC.; Gil Tomás, DA.; Saiz-Adalid, L.; Gil, P. (2014). Effects of intermittent faults on the reliability of a Reduced Instruction Set Computing (RISC) microprocessor. IEEE Transactions on Reliability. 63(1):144-153. https://doi.org/10.1109/TR.2014.2299711S14415363

RiuNet

The Challenge of Detection and Diagnosis of Fugacious Hardware Faults in VLSI Designs

Author: A. Avizienis
A. Bondavalli
A. Correcher
C. Bolchini
C. Constantinescu
E.B. Nightingale
J. Savir
J. Sosnowski
K. Kimseng
M. Goessel
M. Nicolaidis
N.A. Touba
O. Hannius
P.E. Dodd
S. Borkar
S. D’Angelo
S.B. Ko
V. Narayanan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-38789-0_7Current integration scales are increasing the number and types of faults that embedded systems must face. Traditional approaches focus on dealing with those transient and permanent faults that impact the state or output of systems, whereas little research has targeted those faults being logically, electrically or temporally masked -which we have named fugacious. A fast detection and precise diagnosis of faults occurrence, even if the provided service is unaffected, could be of invaluable help to determine, for instance, that systems are currently under the influence of environmental disturbances like radiation, suffering from wear-out, or being affected by an intermittent fault. Upon detection, systems may react to adapt the deployed fault tolerance mechanisms to the diagnosed problem. This paper explores these ideas evaluating challenges and requirements involved, and provides an outline of potential techniques to be applied.This work has been funded by Spanish Ministry of Economy ARENES project (TIN2012-38308-C02-01)Espinosa García, J.; Andrés Martínez, DD.; Ruiz, JC.; Gil, P. (2013). The Challenge of Detection and Diagnosis of Fugacious Hardware Faults in VLSI Designs. En Dependable Computing. Springer. 76-87. https://doi.org/10.1007/978-3-642-38789-0_7S7687Narayanan, V., Xie, Y.: Reliability concerns in embedded systems design. IEEE Computer 1(39), 118–120 (2006)Hannius, O., Karlsson, J.: Impact of soft errors in a jet engine controller. In: Ortmeier, F., Daniel, P. (eds.) SAFECOMP 2012. LNCS, vol. 7612, pp. 223–234. Springer, Heidelberg (2012)Borkar, S.: Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25(6), 10–16 (2005)JEDEC: Measurement and reporting of alpha particle and terrestrial cosmic ray-induced soft errors in semiconductor devices. JEDEC Standard JESD89A. JEDEC (2006)Gracia-Moran, J., Gil-Tomas, D., Saiz-Adalid, L.J., Baraza, J.C., Gil-Vicente, P.J.: Experimental validation of a fault tolerant microcomputer system against intermittent faults. In: DSN, pp. 413–418 (2010)Constantinescu, C.: Intermittent faults and effects on reliability of integrated circuits. In: Proceedings of the 2008 Annual Reliability and Maintainability Symposium, pp. 370–374. IEEE Computer Society, Washington, DC (2008)Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Secur. Comput. 1, 11–33 (2004)Johnson, C., Holloway, C.: The dangers of failure masking in fault-tolerant software: Aspects of a recent in-flight upset event. In: 2007 2nd Institution of Engineering and Technology International Conference on System Safety, pp. 60–65 (October 2007)Bolchini, C., Salice, F., Sciuto, D.: Fault analysis for networks with concurrent error detection. IEEE Des. Test 15(4), 66–74 (1998)Goessel, M., Ocheretny, V., Sogomonyan, E., Marienfeld, D.: New Methods of Concurrent Checking (Frontiers in Electronic Testing), 1st edn. Springer Publishing Company, Incorporated (2008)Iyer, R.K., Rossetti, D.J.: A statistical load dependency model for cpu errors at slac. In: Twenty-Fifth International Symposium on Fault-Tolerant Computing, ‘Highlights from Twenty-Five Years’, p. 373 (June 1995)Dodd, P.E., Shaneyfelt, M.R., Felix, J.A., Schwank, J.R.: Production and propagation of single-event transients in high-speed digital logic ics. IEEE Transactions on Nuclear Science 51, 3278–3284 (2004)Nightingale, E.B., Douceur, J.R., Orgovan, V.: Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer pcs. In: Proceedings of the Sixth Conference on Computer Systems, EuroSys 2011, pp. 343–356. ACM, New York (2011)Kimseng, K., Hoit, M., Tiwari, N., Pecht, M.: Physics-of-failure assessment of a cruise control module. Microelectronics Reliability 39(10), 1423–1444 (1999)Savir, J.: Detection of single intermittent faults in sequential circuits. IEEE Trans. Comput. 29(7), 673–678 (1980)Correcher, A., Garcia, E., Morant, F., Quiles, E., Rodriguez, L.: Intermittent failure dynamics characterization. IEEE Transactions on Reliability 61(3), 649–658 (2012)Sorensen, B., Kelly, G., Sajecki, A., Sorensen, P.: An analyzer for detecting intermittent faults in electronic devices. In: AUTOTESTCON 1994. IEEE Systems Readiness Technology Conference. ‘Cost Effective Support Into the Next Century’, Conference Proceedings, pp. 417–421 (September 1994)Sosnowski, J.: Transient fault tolerance in digital systems. IEEE Micro 14(1), 24–35 (1994)Bondavalli, A., Chiaradonna, S., Di Giandomenico, F., Grandoni, F.: Threshold-based mechanisms to discriminate transient from intermittent faults. IEEE Trans. Comput. 49(3), 230–245 (2000)Rashid, L., Pattabiraman, K., Gopalakrishnan, S.: Intermittent hardware errors and recovery: modelling and evaluation. In: International Conference on Quantitative Evaluation of Systems, QEST (2012)Touba, N.A., McCluskey, E.J.: Logic synthesis of multilevel circuits with concurrent error detection. IEEE Trans. CAD 16(7), 783–789 (1997)Nicolaidis, M., Manich, S., Figueras, J.: Achieving fault secureness in parity prediction arithmetic operators: General conditions and implementations. In: Proceedings of the 1996 European conference on Design and Test, EDTC 1996, pp. 186–193. IEEE Computer Society, Washington, DC (1996)Ko, S.B., Lo, J.C.: Efficient realization of parity prediction functions in fpgas. J. Electron. Test. 20(5), 489–499 (2004)D’Angelo, S., Sechi, G.R., Metra, C.: Transient and permanent fault diagnosis for fpga-based tmr systems. In: Proceedings of the 14th International Symposium on Defect and Fault-Tolerance in VLSI Systems, DFT 1999, pp. 330–338. IEEE Computer Society, Washington, DC (1999)Kim, C.: Detection and location of intermittent faults by monitoring carrier signal channel behavior of electrical interconnection system. In: Electric Ship Technologies Symposium, ESTS 2009, pp. 449–455. IEEE (April 2009

RiuNet

Integrated analysis of error detection and recovery

Author: Lee Y. H.
Shin K. G.
Publication venue
Publication date
Field of study

An integrated modeling and analysis of error detection and recovery is presented. When fault latency and/or error latency exist, the system may suffer from multiple faults or error propagations which seriously deteriorate the fault-tolerant capability. Several detection models that enable analysis of the effect of detection mechanisms on the subsequent error handling operations and the overall system reliability were developed. Following detection of the faulty unit and reconfiguration of the system, the contaminated processes or tasks have to be recovered. The strategies of error recovery employed depend on the detection mechanisms and the available redundancy. Several recovery methods including the rollback recovery are considered. The recovery overhead is evaluated as an index of the capabilities of the detection and reconfiguration mechanisms

Characterizing the Effects of Intermittent Faults on a Processor for Dependability Enhancement Strategy

Author: Chao(Saul) Wang
Dong-Sheng Wang
Hong-Song Chen
Zhong-Chuan Fu
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2014
Field of study

As semiconductor technology scales into the nanometer regime, intermittent faults have become an increasing threat. This paper focuses on the effects of intermittent faults on NET versus REG on one hand and the implications for dependability strategy on the other. First, the vulnerability characteristics of representative units in OpenSPARC T2 are revealed, and in particular, the highly sensitive modules are identified. Second, an arch-level dependability enhancement strategy is proposed, showing that events such as core/strand running status and core-memory interface events can be candidates of detectable symptoms. A simple watchdog can be deployed to detect application running status (IEXE event). Then SDC (silent data corruption) rate is evaluated demonstrating its potential. Third and last, the effects of traditional protection schemes in the target CMT to intermittent faults are quantitatively studied on behalf of the contribution of each trap type, demonstrating the necessity of taking this factor into account for the strategy

Directory of Open Access Journals

Experimental analysis of computer system dependability

Author: Iyer Ravishankar, K.
Tang Dong
Publication venue
Publication date
Field of study

This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance

Report of the IEEE Workshop on Measurement and Modeling of Computer Dependability

Author: Iyer Ravishankar K.
Siewiorek Daniel P.
Publication venue: Center for Reliable and High-Performance Computing, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign
Publication date: 01/09/1990
Field of study

Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNASA Langley Research Center / NASA NAG-1-602 and NASA NAG-1-613ONR / N00014-85-K-000

Proposal of an Adaptive Fault Tolerance Mechanism to Tolerate Intermittent Faults in RAM

Author: Baraza Calvo Juan Carlos
Gil Tomás Daniel Antonio
Gil Pedro
Gracia-Morán Joaquín
Saiz-Adalid Luis-J.
Publication venue: 'MDPI AG'
Publication date: 01/12/2020
Field of study

[EN] Due to transistor shrinking, intermittent faults are a major concern in current digital systems. This work presents an adaptive fault tolerance mechanism based on error correction codes (ECC), able to modify its behavior when the error conditions change without increasing the redundancy. As a case example, we have designed a mechanism that can detect intermittent faults and swap from an initial generic ECC to a specific ECC capable of tolerating one intermittent fault. We have inserted the mechanism in the memory system of a 32-bit RISC processor and validated it by using VHDL simulation-based fault injection. We have used two (39, 32) codes: a single error correction-double error detection (SEC-DED) and a code developed by our research group, called EPB3932, capable of correcting single errors and double and triple adjacent errors that include a bit previously tagged as error-prone. The results of injecting transient, intermittent, and combinations of intermittent and transient faults show that the proposed mechanism works properly. As an example, the percentage of failures and latent errors is 0% when injecting a triple adjacent fault after an intermittent stuck-at fault. We have synthesized the adaptive fault tolerance mechanism proposed in two types of FPGAs: non-reconfigurable and partially reconfigurable. In both cases, the overhead introduced is affordable in terms of hardware, time and power consumption.This research was supported in part by the Spanish Government, project TIN2016-81,075-R, and by Primeros Proyectos de Investigacion (PAID-06-18), Vicerrectorado de Investigacion, Innovacion y Transferencia de la Universitat Politecnica de Valencia (UPV), project 20190032.Baraza Calvo, JC.; Gracia-Morán, J.; Saiz-Adalid, L.; Gil Tomás, DA.; Gil, P. (2020). Proposal of an Adaptive Fault Tolerance Mechanism to Tolerate Intermittent Faults in RAM. Electronics. 9(12):1-30. https://doi.org/10.3390/electronics9122074S130912International Technology Roadmap for Semiconductors (ITRS)http://www.itrs2.net/2013-itrs.htmlJeng, S.-L., Lu, J.-C., & Wang, K. (2007). A Review of Reliability Research on Nanotechnology. IEEE Transactions on Reliability, 56(3), 401-410. doi:10.1109/tr.2007.903188Ibe, E., Taniguchi, H., Yahagi, Y., Shimbo, K., & Toba, T. (2010). Impact of Scaling on Neutron-Induced Soft Error in SRAMs From a 250 nm to a 22 nm Design Rule. IEEE Transactions on Electron Devices, 57(7), 1527-1538. doi:10.1109/ted.2010.2047907Boussif, A., Ghazel, M., & Basilio, J. C. (2020). Intermittent fault diagnosability of discrete event systems: an overview of automaton-based approaches. Discrete Event Dynamic Systems, 31(1), 59-102. doi:10.1007/s10626-020-00324-yConstantinescu, C. (2003). Trends and challenges in VLSI circuit reliability. IEEE Micro, 23(4), 14-19. doi:10.1109/mm.2003.1225959Bondavalli, A., Chiaradonna, S., Di Giandomenico, F., & Grandoni, F. (2000). Threshold-based mechanisms to discriminate transient from intermittent faults. IEEE Transactions on Computers, 49(3), 230-245. doi:10.1109/12.841127Contant, O., Lafortune, S., & Teneketzis, D. (2004). Diagnosis of Intermittent Faults. Discrete Event Dynamic Systems, 14(2), 171-202. doi:10.1023/b:disc.0000018570.20941.d2Sorensen, B. A., Kelly, G., Sajecki, A., & Sorensen, P. W. (s. f.). An analyzer for detecting intermittent faults in electronic devices. Proceedings of AUTOTESTCON ’94. doi:10.1109/autest.1994.381590Gracia-Moran, J., Gil-Tomas, D., Saiz-Adalid, L. J., Baraza, J. C., & Gil-Vicente, P. J. (2010). Experimental validation of a fault tolerant microcomputer system against intermittent faults. 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN). doi:10.1109/dsn.2010.5544288Fujiwara, E. (2005). Code Design for Dependable Systems. doi:10.1002/0471792748Hamming, R. W. (1950). Error Detecting and Error Correcting Codes. Bell System Technical Journal, 29(2), 147-160. doi:10.1002/j.1538-7305.1950.tb00463.xSaiz-Adalid, L.-J., Gil-Vicente, P.-J., Ruiz-García, J.-C., Gil-Tomás, D., Baraza, J.-C., & Gracia-Morán, J. (2013). Flexible Unequal Error Control Codes with Selectable Error Detection and Correction Levels. Computer Safety, Reliability, and Security, 178-189. doi:10.1007/978-3-642-40793-2_17Frei, R., McWilliam, R., Derrick, B., Purvis, A., Tiwari, A., & Di Marzo Serugendo, G. (2013). Self-healing and self-repairing technologies. The International Journal of Advanced Manufacturing Technology, 69(5-8), 1033-1061. doi:10.1007/s00170-013-5070-2Maiz, J., Hareland, S., Zhang, K., & Armstrong, P. (s. f.). Characterization of multi-bit soft error events in advanced SRAMs. IEEE International Electron Devices Meeting 2003. doi:10.1109/iedm.2003.1269335Schroeder, B., Pinheiro, E., & Weber, W.-D. (2011). DRAM errors in the wild. Communications of the ACM, 54(2), 100-107. doi:10.1145/1897816.1897844BanaiyanMofrad, A., Ebrahimi, M., Oboril, F., Tahoori, M. B., & Dutt, N. (2015). Protecting caches against multi-bit errors using embedded erasure coding. 2015 20th IEEE European Test Symposium (ETS). doi:10.1109/ets.2015.7138735Kim, J., Sullivan, M., Lym, S., & Erez, M. (2016). All-Inclusive ECC: Thorough End-to-End Protection for Reliable Computer Memory. 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). doi:10.1109/isca.2016.60Hwang, A. A., Stefanovici, I. A., & Schroeder, B. (2012). Cosmic rays don’t strike twice. ACM SIGPLAN Notices, 47(4), 111-122. doi:10.1145/2248487.2150989Gil-Tomás, D., Gracia-Morán, J., Baraza-Calvo, J.-C., Saiz-Adalid, L.-J., & Gil-Vicente, P.-J. (2012). Studying the effects of intermittent faults on a microcontroller. Microelectronics Reliability, 52(11), 2837-2846. doi:10.1016/j.microrel.2012.06.004Plasma CPU Modelhttps://opencores.org/projects/plasmaArlat, J., Aguera, M., Amat, L., Crouzet, Y., Fabre, J.-C., Laprie, J.-C., … Powell, D. (1990). Fault injection for dependability validation: a methodology and some applications. IEEE Transactions on Software Engineering, 16(2), 166-182. doi:10.1109/32.44380Gil-Tomas, D., Gracia-Moran, J., Baraza-Calvo, J.-C., Saiz-Adalid, L.-J., & Gil-Vicente, P.-J. (2012). Analyzing the Impact of Intermittent Faults on Microprocessors Applying Fault Injection. IEEE Design & Test of Computers, 29(6), 66-73. doi:10.1109/mdt.2011.2179514Rashid, L., Pattabiraman, K., & Gopalakrishnan, S. (2010). Modeling the Propagation of Intermittent Hardware Faults in Programs. 2010 IEEE 16th Pacific Rim International Symposium on Dependable Computing. doi:10.1109/prdc.2010.52Amiri, M., Siddiqui, F. M., Kelly, C., Woods, R., Rafferty, K., & Bardak, B. (2016). FPGA-Based Soft-Core Processors for Image Processing Applications. Journal of Signal Processing Systems, 87(1), 139-156. doi:10.1007/s11265-016-1185-7Hailesellasie, M., Hasan, S. R., & Mohamed, O. A. (2019). MulMapper: Towards an Automated FPGA-Based CNN Processor Generator Based on a Dynamic Design Space Exploration. 2019 IEEE International Symposium on Circuits and Systems (ISCAS). doi:10.1109/iscas.2019.8702589Mittal, S. (2018). A survey of FPGA-based accelerators for convolutional neural networks. Neural Computing and Applications, 32(4), 1109-1139. doi:10.1007/s00521-018-3761-1Intel Completes Acquisition of Alterahttps://newsroom.intel.com/news-releases/intel-completes-acquisition-of-altera/#gs.mi6ujuAMD to Acquire Xilinx, Creating the Industry’s High Performance Computing Leaderhttps://www.amd.com/en/press-releases/2020-10-27-amd-to-acquire-xilinx-creating-the-industry-s-high-performance-computingKim, K. H., & Lawrence, T. F. (s. f.). Adaptive fault tolerance: issues and approaches. [1990] Proceedings. Second IEEE Workshop on Future Trends of Distributed Computing Systems. doi:10.1109/ftdcs.1990.138292Gonzalez, O., Shrikumar, H., Stankovic, J. A., & Ramamritham, K. (s. f.). Adaptive fault tolerance and graceful degradation under dynamic hard real-time scheduling. Proceedings Real-Time Systems Symposium. doi:10.1109/real.1997.641271Jacobs, A., George, A. D., & Cieslewski, G. (2009). Reconfigurable fault tolerance: A framework for environmentally adaptive fault mitigation in space. 2009 International Conference on Field Programmable Logic and Applications. doi:10.1109/fpl.2009.5272313Shin, D., Park, J., Park, J., Paul, S., & Bhunia, S. (2017). Adaptive ECC for Tailored Protection of Nanoscale Memory. IEEE Design & Test, 34(6), 84-93. doi:10.1109/mdat.2016.2615844Silva, F., Muniz, A., Silveira, J., & Marcon, C. (2020). CLC-A: An Adaptive Implementation of the Column Line Code (CLC) ECC. 2020 33rd Symposium on Integrated Circuits and Systems Design (SBCCI). doi:10.1109/sbcci50935.2020.9189901Mukherjee, S. S., Emer, J., Fossum, T., & Reinhardt, S. K. (s. f.). Cache scrubbing in microprocessors: myth or necessity? 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings. doi:10.1109/prdc.2004.1276550Saleh, A. M., Serrano, J. J., & Patel, J. H. (1990). Reliability of scrubbing recovery-techniques for memory systems. IEEE Transactions on Reliability, 39(1), 114-122. doi:10.1109/24.52622X9SRA User’s Manual (Rev. 1.1)https://www.manualshelf.com/manual/supermicro/x9sra/user-s-manual-1-1.htmlChishti, Z., Alameldeen, A. R., Wilkerson, C., Wu, W., & Lu, S.-L. (2009). Improving cache lifetime reliability at ultra-low voltages. Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture - Micro-42. doi:10.1145/1669112.1669126Datta, R., & Touba, N. A. (2011). Designing a fast and adaptive error correction scheme for increasing the lifetime of phase change memories. 29th VLSI Test Symposium. doi:10.1109/vts.2011.5783773Kim, J., Lim, J., Cho, W., Shin, K.-S., Kim, H., & Lee, H.-J. (2016). Adaptive Memory Controller for High-performance Multi-channel Memory. JSTS:Journal of Semiconductor Technology and Science, 16(6), 808-816. doi:10.5573/jsts.2016.16.6.808Yuan, L., Liu, H., Jia, P., & Yang, Y. (2015). Reliability-Based ECC System for Adaptive Protection of NAND Flash Memories. 2015 Fifth International Conference on Communication Systems and Network Technologies. doi:10.1109/csnt.2015.23Zhou, Y., Wu, F., Lu, Z., He, X., Huang, P., & Xie, C. (2019). SCORE. ACM Transactions on Architecture and Code Optimization, 15(4), 1-25. doi:10.1145/3291052Lu, S.-K., Li, H.-P., & Miyase, K. (2018). Adaptive ECC Techniques for Reliability and Yield Enhancement of Phase Change Memory. 2018 IEEE 24th International Symposium on On-Line Testing And Robust System Design (IOLTS). doi:10.1109/iolts.2018.8474118Chen, J., Andjelkovic, M., Simevski, A., Li, Y., Skoncej, P., & Krstic, M. (2019). Design of SRAM-Based Low-Cost SEU Monitor for Self-Adaptive Multiprocessing Systems. 2019 22nd Euromicro Conference on Digital System Design (DSD). doi:10.1109/dsd.2019.00080Wang, X., Jiang, L., & Chakrabarty, K. (2020). LSTM-based Analysis of Temporally- and Spatially-Correlated Signatures for Intermittent Fault Detection. 2020 IEEE 38th VLSI Test Symposium (VTS). doi:10.1109/vts48691.2020.9107600Ebrahimi, H., & G. Kerkhoff, H. (2018). Intermittent Resistance Fault Detection at Board Level. 2018 IEEE 21st International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS). doi:10.1109/ddecs.2018.00031Ebrahimi, H., & Kerkhoff, H. G. (2020). A New Monitor Insertion Algorithm for Intermittent Fault Detection. 2020 IEEE European Test Symposium (ETS). doi:10.1109/ets48528.2020.9131563Hsiao, M. Y. (1970). A Class of Optimal Minimum Odd-weight-column SEC-DED Codes. IBM Journal of Research and Development, 14(4), 395-401. doi:10.1147/rd.144.0395Benso, A., & Prinetto, P. (Eds.). (2004). Fault Injection Techniques and Tools for Embedded Systems Reliability Evaluation. Frontiers in Electronic Testing. doi:10.1007/b105828Gracia, J., Saiz, L. J., Baraza, J. C., Gil, D., & Gil, P. J. (2008). Analysis of the influence of intermittent faults in a microcontroller. 2008 11th IEEE Workshop on Design and Diagnostics of Electronic Circuits and Systems. doi:10.1109/ddecs.2008.4538761ZC702 Evaluation Board for the Zynq-7000 XC7Z020 SoChttps://www.xilinx.com/support/documentation/boards_and_kits/zc702_zvik/ug850-zc702-eval-bd.pd

RiuNet

Simulation-based Fault Injection with QEMU for Speeding-up Dependability Analysis of Embedded Software

Author: Ferraretto Davide
Pravadelli Graziano
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Simulation-based fault injection (SFI) represents a valuable solu- tion for early analysis of software dependability and fault tolerance properties before the physical prototype of the target platform is available. Some SFI approaches base the fault injection strategy on cycle-accurate models imple- mented by means of Hardware Description Languages (HDLs). However, cycle- accurate simulation has revealed to be too time-consuming when the objective is to emulate the effect of soft errors on complex microprocessors. To overcome this issue, SFI solutions based on virtual prototypes of the target platform has started to be proposed. However, current approaches still present some draw- backs, like, for example, they work only for specific CPU architectures, or they require code instrumentation, or they have a different target (i.e., design errors instead of dependability analysis). To address these disadvantages, this paper presents an efficient fault injection approach based on QEMU, one of the most efficient and popular instruction-accurate emulator for several microprocessor architectures. As main goal, the proposed approach represents a non intrusive technique for simulating hardware faults affecting CPU behaviours. Perma- nent and transient/intermittent hardware fault models have been abstracted without losing quality for software dependability analysis. The approach mini- mizes the impact of the fault injection procedure in the emulator performance by preserving the original dynamic binary translation mechanism of QEMU. Experimental results for both x86 and ARM processors proving the efficiency and effectiveness of the proposed approach are presented

Fault injection testing of software implemented fault tolerance mechanisms of distributed systems

Author: Tao Sha
Publication venue: Newcastle University
Publication date: 01/01/1996
Field of study

PhD ThesisOne way of gaining confidence in the adequacy of fault tolerance mechanisms of a system is to test the system by injecting faults and see how the system performs under faulty conditions. This thesis investigates the issues of testing software-implemented fault tolerance mechanisms of distributed systems through fault injection. A fault injection method has been developed. The method requires that the target software system be structured as a collection of objects interacting via messages. This enables easy insertion of fault injection objects into the target system to emulate incorrect behaviour of faulty processors by manipulating messages. This approach allows one to inject specific classes of faults while not requiring any significant changes to the target system. The method differs from the previous work in that it exploits an object oriented approach of software implementation to support the injection of specific classes of faults at the system level. The proposed fault injection method has been applied to test software-implemented reliable node systems: a TMR (triple modular redundant) node and a fail-silent node. The nodes have integrated fault tolerance mechanisms and are expected to exhibit certain behaviour in the presence of a failure. The thesis describes how various such mechanisms (for example, clock synchronisation protocol, and atomic broadcast protocol) were tested. The testing revealed flaws in implementation that had not been discovered before, thereby demonstrating the usefulness of the method. Application of the approach to other distributed systems is also described in the thesis.CEC ESPRIT programme, UK Engineering and Physical Sciences Research Council (EPSRC)

Newcastle University eTheses