2,945 research outputs found

    Experimental analysis of computer system dependability

    Get PDF
    This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance

    New Fault Detection, Mitigation and Injection Strategies for Current and Forthcoming Challenges of HW Embedded Designs

    Full text link
    Tesis por compendio[EN] Relevance of electronics towards safety of common devices has only been growing, as an ever growing stake of the functionality is assigned to them. But of course, this comes along the constant need for higher performances to fulfill such functionality requirements, while keeping power and budget low. In this scenario, industry is struggling to provide a technology which meets all the performance, power and price specifications, at the cost of an increased vulnerability to several types of known faults or the appearance of new ones. To provide a solution for the new and growing faults in the systems, designers have been using traditional techniques from safety-critical applications, which offer in general suboptimal results. In fact, modern embedded architectures offer the possibility of optimizing the dependability properties by enabling the interaction of hardware, firmware and software levels in the process. However, that point is not yet successfully achieved. Advances in every level towards that direction are much needed if flexible, robust, resilient and cost effective fault tolerance is desired. The work presented here focuses on the hardware level, with the background consideration of a potential integration into a holistic approach. The efforts in this thesis have focused several issues: (i) to introduce additional fault models as required for adequate representativity of physical effects blooming in modern manufacturing technologies, (ii) to provide tools and methods to efficiently inject both the proposed models and classical ones, (iii) to analyze the optimum method for assessing the robustness of the systems by using extensive fault injection and later correlation with higher level layers in an effort to cut development time and cost, (iv) to provide new detection methodologies to cope with challenges modeled by proposed fault models, (v) to propose mitigation strategies focused towards tackling such new threat scenarios and (vi) to devise an automated methodology for the deployment of many fault tolerance mechanisms in a systematic robust way. The outcomes of the thesis constitute a suite of tools and methods to help the designer of critical systems in his task to develop robust, validated, and on-time designs tailored to his application.[ES] La relevancia que la electrónica adquiere en la seguridad de los productos ha crecido inexorablemente, puesto que cada vez ésta copa una mayor influencia en la funcionalidad de los mismos. Pero, por supuesto, este hecho viene acompañado de una necesidad constante de mayores prestaciones para cumplir con los requerimientos funcionales, al tiempo que se mantienen los costes y el consumo en unos niveles reducidos. En este escenario, la industria está realizando esfuerzos para proveer una tecnología que cumpla con todas las especificaciones de potencia, consumo y precio, a costa de un incremento en la vulnerabilidad a múltiples tipos de fallos conocidos o la introducción de nuevos. Para ofrecer una solución a los fallos nuevos y crecientes en los sistemas, los diseñadores han recurrido a técnicas tradicionalmente asociadas a sistemas críticos para la seguridad, que ofrecen en general resultados sub-óptimos. De hecho, las arquitecturas empotradas modernas ofrecen la posibilidad de optimizar las propiedades de confiabilidad al habilitar la interacción de los niveles de hardware, firmware y software en el proceso. No obstante, ese punto no está resulto todavía. Se necesitan avances en todos los niveles en la mencionada dirección para poder alcanzar los objetivos de una tolerancia a fallos flexible, robusta, resiliente y a bajo coste. El trabajo presentado aquí se centra en el nivel de hardware, con la consideración de fondo de una potencial integración en una estrategia holística. Los esfuerzos de esta tesis se han centrado en los siguientes aspectos: (i) la introducción de modelos de fallo adicionales requeridos para la representación adecuada de efectos físicos surgentes en las tecnologías de manufactura actuales, (ii) la provisión de herramientas y métodos para la inyección eficiente de los modelos propuestos y de los clásicos, (iii) el análisis del método óptimo para estudiar la robustez de sistemas mediante el uso de inyección de fallos extensiva, y la posterior correlación con capas de más alto nivel en un esfuerzo por recortar el tiempo y coste de desarrollo, (iv) la provisión de nuevos métodos de detección para cubrir los retos planteados por los modelos de fallo propuestos, (v) la propuesta de estrategias de mitigación enfocadas hacia el tratamiento de dichos escenarios de amenaza y (vi) la introducción de una metodología automatizada de despliegue de diversos mecanismos de tolerancia a fallos de forma robusta y sistemática. Los resultados de la presente tesis constituyen un conjunto de herramientas y métodos para ayudar al diseñador de sistemas críticos en su tarea de desarrollo de diseños robustos, validados y en tiempo adaptados a su aplicación.[CA] La rellevància que l'electrònica adquireix en la seguretat dels productes ha crescut inexorablement, puix cada volta més aquesta abasta una major influència en la funcionalitat dels mateixos. Però, per descomptat, aquest fet ve acompanyat d'un constant necessitat de majors prestacions per acomplir els requeriments funcionals, mentre es mantenen els costos i consums en uns nivells reduïts. Donat aquest escenari, la indústria està fent esforços per proveir una tecnologia que complisca amb totes les especificacions de potència, consum i preu, tot a costa d'un increment en la vulnerabilitat a diversos tipus de fallades conegudes, i a la introducció de nous tipus. Per oferir una solució a les noves i creixents fallades als sistemes, els dissenyadors han recorregut a tècniques tradicionalment associades a sistemes crítics per a la seguretat, que en general oferixen resultats sub-òptims. De fet, les arquitectures empotrades modernes oferixen la possibilitat d'optimitzar les propietats de confiabilitat en habilitar la interacció dels nivells de hardware, firmware i software en el procés. Tot i això eixe punt no està resolt encara. Es necessiten avanços a tots els nivells en l'esmentada direcció per poder assolir els objectius d'una tolerància a fallades flexible, robusta, resilient i a baix cost. El treball ací presentat se centra en el nivell de hardware, amb la consideració de fons d'una potencial integració en una estratègia holística. Els esforços d'esta tesi s'han centrat en els següents aspectes: (i) la introducció de models de fallada addicionals requerits per a la representació adequada d'efectes físics que apareixen en les tecnologies de fabricació actuals, (ii) la provisió de ferramentes i mètodes per a la injecció eficient del models proposats i dels clàssics, (iii) l'anàlisi del mètode òptim per estudiar la robustesa de sistemes mitjançant l'ús d'injecció de fallades extensiva, i la posterior correlació amb capes de més alt nivell en un esforç per retallar el temps i cost de desenvolupament, (iv) la provisió de nous mètodes de detecció per cobrir els reptes plantejats pels models de fallades proposats, (v) la proposta d'estratègies de mitigació enfocades cap al tractament dels esmentats escenaris d'amenaça i (vi) la introducció d'una metodologia automatitzada de desplegament de diversos mecanismes de tolerància a fallades de forma robusta i sistemàtica. Els resultats de la present tesi constitueixen un conjunt de ferramentes i mètodes per ajudar el dissenyador de sistemes crítics en la seua tasca de desenvolupament de dissenys robustos, validats i a temps adaptats a la seua aplicació.Espinosa García, J. (2016). New Fault Detection, Mitigation and Injection Strategies for Current and Forthcoming Challenges of HW Embedded Designs [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/73146TESISCompendi

    DeSyRe: on-Demand System Reliability

    No full text
    The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints

    The Challenge of Detection and Diagnosis of Fugacious Hardware Faults in VLSI Designs

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-38789-0_7Current integration scales are increasing the number and types of faults that embedded systems must face. Traditional approaches focus on dealing with those transient and permanent faults that impact the state or output of systems, whereas little research has targeted those faults being logically, electrically or temporally masked -which we have named fugacious. A fast detection and precise diagnosis of faults occurrence, even if the provided service is unaffected, could be of invaluable help to determine, for instance, that systems are currently under the influence of environmental disturbances like radiation, suffering from wear-out, or being affected by an intermittent fault. Upon detection, systems may react to adapt the deployed fault tolerance mechanisms to the diagnosed problem. This paper explores these ideas evaluating challenges and requirements involved, and provides an outline of potential techniques to be applied.This work has been funded by Spanish Ministry of Economy ARENES project (TIN2012-38308-C02-01)Espinosa García, J.; Andrés Martínez, DD.; Ruiz, JC.; Gil, P. (2013). The Challenge of Detection and Diagnosis of Fugacious Hardware Faults in VLSI Designs. En Dependable Computing. Springer. 76-87. https://doi.org/10.1007/978-3-642-38789-0_7S7687Narayanan, V., Xie, Y.: Reliability concerns in embedded systems design. IEEE Computer 1(39), 118–120 (2006)Hannius, O., Karlsson, J.: Impact of soft errors in a jet engine controller. In: Ortmeier, F., Daniel, P. (eds.) SAFECOMP 2012. LNCS, vol. 7612, pp. 223–234. Springer, Heidelberg (2012)Borkar, S.: Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25(6), 10–16 (2005)JEDEC: Measurement and reporting of alpha particle and terrestrial cosmic ray-induced soft errors in semiconductor devices. JEDEC Standard JESD89A. JEDEC (2006)Gracia-Moran, J., Gil-Tomas, D., Saiz-Adalid, L.J., Baraza, J.C., Gil-Vicente, P.J.: Experimental validation of a fault tolerant microcomputer system against intermittent faults. In: DSN, pp. 413–418 (2010)Constantinescu, C.: Intermittent faults and effects on reliability of integrated circuits. In: Proceedings of the 2008 Annual Reliability and Maintainability Symposium, pp. 370–374. IEEE Computer Society, Washington, DC (2008)Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Secur. Comput. 1, 11–33 (2004)Johnson, C., Holloway, C.: The dangers of failure masking in fault-tolerant software: Aspects of a recent in-flight upset event. In: 2007 2nd Institution of Engineering and Technology International Conference on System Safety, pp. 60–65 (October 2007)Bolchini, C., Salice, F., Sciuto, D.: Fault analysis for networks with concurrent error detection. IEEE Des. Test 15(4), 66–74 (1998)Goessel, M., Ocheretny, V., Sogomonyan, E., Marienfeld, D.: New Methods of Concurrent Checking (Frontiers in Electronic Testing), 1st edn. Springer Publishing Company, Incorporated (2008)Iyer, R.K., Rossetti, D.J.: A statistical load dependency model for cpu errors at slac. In: Twenty-Fifth International Symposium on Fault-Tolerant Computing, ‘Highlights from Twenty-Five Years’, p. 373 (June 1995)Dodd, P.E., Shaneyfelt, M.R., Felix, J.A., Schwank, J.R.: Production and propagation of single-event transients in high-speed digital logic ics. IEEE Transactions on Nuclear Science 51, 3278–3284 (2004)Nightingale, E.B., Douceur, J.R., Orgovan, V.: Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer pcs. In: Proceedings of the Sixth Conference on Computer Systems, EuroSys 2011, pp. 343–356. ACM, New York (2011)Kimseng, K., Hoit, M., Tiwari, N., Pecht, M.: Physics-of-failure assessment of a cruise control module. Microelectronics Reliability 39(10), 1423–1444 (1999)Savir, J.: Detection of single intermittent faults in sequential circuits. IEEE Trans. Comput. 29(7), 673–678 (1980)Correcher, A., Garcia, E., Morant, F., Quiles, E., Rodriguez, L.: Intermittent failure dynamics characterization. IEEE Transactions on Reliability 61(3), 649–658 (2012)Sorensen, B., Kelly, G., Sajecki, A., Sorensen, P.: An analyzer for detecting intermittent faults in electronic devices. In: AUTOTESTCON 1994. IEEE Systems Readiness Technology Conference. ‘Cost Effective Support Into the Next Century’, Conference Proceedings, pp. 417–421 (September 1994)Sosnowski, J.: Transient fault tolerance in digital systems. IEEE Micro 14(1), 24–35 (1994)Bondavalli, A., Chiaradonna, S., Di Giandomenico, F., Grandoni, F.: Threshold-based mechanisms to discriminate transient from intermittent faults. IEEE Trans. Comput. 49(3), 230–245 (2000)Rashid, L., Pattabiraman, K., Gopalakrishnan, S.: Intermittent hardware errors and recovery: modelling and evaluation. In: International Conference on Quantitative Evaluation of Systems, QEST (2012)Touba, N.A., McCluskey, E.J.: Logic synthesis of multilevel circuits with concurrent error detection. IEEE Trans. CAD 16(7), 783–789 (1997)Nicolaidis, M., Manich, S., Figueras, J.: Achieving fault secureness in parity prediction arithmetic operators: General conditions and implementations. In: Proceedings of the 1996 European conference on Design and Test, EDTC 1996, pp. 186–193. IEEE Computer Society, Washington, DC (1996)Ko, S.B., Lo, J.C.: Efficient realization of parity prediction functions in fpgas. J. Electron. Test. 20(5), 489–499 (2004)D’Angelo, S., Sechi, G.R., Metra, C.: Transient and permanent fault diagnosis for fpga-based tmr systems. In: Proceedings of the 14th International Symposium on Defect and Fault-Tolerance in VLSI Systems, DFT 1999, pp. 330–338. IEEE Computer Society, Washington, DC (1999)Kim, C.: Detection and location of intermittent faults by monitoring carrier signal channel behavior of electrical interconnection system. In: Electric Ship Technologies Symposium, ESTS 2009, pp. 449–455. IEEE (April 2009

    The integration of on-line monitoring and reconfiguration functions using IEEE1149.4 into a safety critical automotive electronic control unit.

    Get PDF
    This paper presents an innovative application of IEEE 1149.4 and the integrated diagnostic reconfiguration (IDR) as tools for the implementation of an embedded test solution for an automotive electronic control unit, implemented as a fully integrated mixed signal system. The paper describes how the test architecture can be used for fault avoidance with results from a hardware prototype presented. The paper concludes that fault avoidance can be integrated into mixed signal electronic systems to handle key failure modes

    A Survey of Fault-Injection Methodologies for Soft Error Rate Modeling in Systems-on-Chips

    Get PDF
    The development of process technology has increased system performance, but the system failure probability has also significantly increased. It is important to consider the system reliability in addition to the cost, performance, and power consumption. In this paper, we describe the types of faults that occur in a system and where these faults originate. Then, fault-injection techniques, which are used to characterize the fault rate of a system-on-chip (SoC), are investigated to provide a guideline to SoC designers for the realization of resilient SoCs

    Lightweight Architectures for Reliable and Fault Detection Simon and Speck Cryptographic Algorithms on FPGA

    Get PDF
    The widespread use of sensitive and constrained applications necessitates lightweight (lowpower and low-area) algorithms developed for constrained nano-devices. However, nearly all of such algorithms are optimized for platform-based performance and may not be useful for diverse and flexible applications. The National Security Agency (NSA) has proposed two relatively-recent families of lightweight ciphers, i.e., Simon and Speck, designed as efficient ciphers on both hardware and software platforms. This paper proposes concurrent error detection schemes to provide reliable architectures for these two families of lightweight block ciphers. The research work on analyzing the reliability of these algorithms and providing fault diagnosis approaches has not been undertaken to date to the best of our knowledge. The main aim of the proposed reliable architectures is to provide high error coverage while maintaining acceptable area and power consumption overheads. To achieve this, we propose a variant of recomputing with encoded operands. These low-complexity schemes are suited for lowresource applications such as sensitive, constrained implantable and wearable medical devices. We perform fault simulations for the proposed architectures by developing a fault model framework. The architectures are simulated and analyzed on recent field-programmable grate array (FPGA) platforms, and it is shown that the proposed schemes provide high error coverage. The proposed low-complexity concurrent error detection schemes are a step forward towards more reliable architectures for Simon and Speck algorithms in lightweight, secure applications

    No Fault Found events in maintenance engineering Part 2: Root causes, technical developments and future research

    Get PDF
    This is the second half of a two paper series covering aspects of the no fault found (NFF) phenomenon, which is highly challenging and is becoming even more important due to increasing complexity and criticality of technical systems. Part 1 introduced the fundamental concept of unknown failures from an organizational, behavioral and cultural stand point. It also reported an industrial outlook to the problem, recent procedural standards, whilst discussing the financial implications and safety concerns. In this issue, the authors examine the technical aspects, reviewing the common causes of NFF failures in electronic, software and mechanical systems. This is followed by a survey on technological techniques actively being used to reduce the consequence of such instances. After discussing improvements in testability, the article identifies gaps in literature and points out the core areas that should be focused in the future. Special attention is paid to the recent trends on knowledge sharing and troubleshooting tools; with potential research on technical diagnosis being enumerated
    corecore