13 research outputs found

    Memory built-in self-repair and correction for improving yield: a review

    Get PDF
    Nanometer memories are highly prone to defects due to dense structure, necessitating memory built-in self-repair as a must-have feature to improve yield. Today’s system-on-chips contain memories occupying an area as high as 90% of the chip area. Shrinking technology uses stricter design rules for memories, making them more prone to manufacturing defects. Further, using 3D-stacked memories makes the system vulnerable to newer defects such as those coming from through-silicon-vias (TSV) and micro bumps. The increased memory size is also resulting in an increase in soft errors during system operation. Multiple memory repair techniques based on redundancy and correction codes have been presented to recover from such defects and prevent system failures. This paper reviews recently published memory repair methodologies, including various built-in self-repair (BISR) architectures, repair analysis algorithms, in-system repair, and soft repair handling using error correcting codes (ECC). It provides a classification of these techniques based on method and usage. Finally, it reviews evaluation methods used to determine the effectiveness of the repair algorithms. The paper aims to present a survey of these methodologies and prepare a platform for developing repair methods for upcoming-generation memories

    Architectures for dependable modern microprocessors

    Get PDF
    Η εξέλιξη των ολοκληρωμένων κυκλωμάτων σε συνδυασμό με τους αυστηρούς χρονικούς περιορισμούς καθιστούν την επαλήθευση της ορθής λειτουργίας των επεξεργαστών μία εξαιρετικά απαιτητική διαδικασία. Με κριτήριο το στάδιο του κύκλου ζωής ενός επεξεργαστή, από την στιγμή κατασκευής των πρωτοτύπων και έπειτα, οι τεχνικές ελέγχου ορθής λειτουργίας διακρίνονται στις ακόλουθες κατηγορίες: (1) Silicon Debug: Τα πρωτότυπα ολοκληρωμένα κυκλώματα ελέγχονται εξονυχιστικά, (2) Manufacturing Testing: ο τελικό ποιοτικός έλεγχος και (3) In-field verification: Περιλαμβάνει τεχνικές, οι οποίες διασφαλίζουν την λειτουργία του επεξεργαστή σύμφωνα με τις προδιαγραφές του. Η διδακτορική διατριβή προτείνει τα ακόλουθα: (1) Silicon Debug: Η εργασία αποσκοπεί στην επιτάχυνση της διαδικασίας ανίχνευσης σφαλμάτων και στον αυτόματο εντοπισμό τυχαίων προγραμμάτων που δεν περιέχουν νέα -χρήσιμη- πληροφορία σχετικά με την αίτια ενός σφάλματος. Η κεντρική ιδέα αυτής της μεθόδου έγκειται στην αξιοποίηση της έμφυτης ποικιλομορφίας των αρχιτεκτονικών συνόλου εντολών και στην δυνατότητα από-διαμόρφωσης τμημάτων του κυκλώματος, (2) Manufacturing Testing: προτείνεται μία μέθοδο για την βελτιστοποίηση του έλεγχου ορθής λειτουργίας των πολυνηματικών και πολυπύρηνων επεξεργαστών μέσω της χρήση λογισμικού αυτοδοκιμής, (3) Ιn-field verification: Αναλύθηκε σε βάθος η επίδραση που έχουν τα μόνιμα σφάλματα σε μηχανισμούς αύξησης της απόδοσης. Επιπρόσθετα, προτάθηκαν τεχνικές για την ανίχνευση και ανοχή μόνιμων σφαλμάτων υλικού σε μηχανισμούς πρόβλεψης διακλάδωσης.Technology scaling, extreme chip integration and the compelling requirement to diminish the time-to-market window, has rendered microprocessors more prone to design bugs and hardware faults. Microprocessor validation is grouped into the following categories, based on where they intervene in a microprocessor’s lifecycle: (a) Silicon debug: the first hardware prototypes are exhaustively validated, (b) Μanufacturing testing: the final quality control during massive production, and (c) In-field verification: runtime error detection techniques to guarantee correct operation. The contributions of this thesis are the following: (1) Silicon debug: We propose the employment of deconfigurable microprocessor architectures along with a technique to generate self-checking random test programs to avoid the simulation step and triage the redundant debug sessions, (2) Manufacturing testing: We propose a self-test optimization strategy for multithreaded, multicore microprocessors to speedup test program execution time and enhance the fault coverage of hard errors; and (3) In-field verification: We measure the effect of permanent faults performance components. Then, we propose a set of low-cost mechanisms for the detection, diagnosis and performance recovery in the front-end speculative structures. This thesis introduces various novel methodologies to address the validation challenges posed throughout the life-cycle of a chip

    Low-cost and efficient fault detection and diagnosis schemes for modern cores

    Get PDF
    Continuous improvements in transistor scaling together with microarchitectural advances have made possible the widespread adoption of high-performance processors across all market segments. However, the growing reliability threats induced by technology scaling and by the complexity of designs are challenging the production of cheap yet robust systems. Soft error trends are haunting, especially for combinational logic, and parity and ECC codes are therefore becoming insufficient as combinational logic turns into the dominant source of soft errors. Furthermore, experts are warning about the need to also address intermittent and permanent faults during processor runtime, as increasing temperatures and device variations will accelerate inherent aging phenomena. These challenges specially threaten the commodity segments, which impose requirements that existing fault tolerance mechanisms cannot offer. Current techniques based on redundant execution were devised in a time when high penalties were assumed for the sake of high reliability levels. Novel light-weight techniques are therefore needed to enable fault protection in the mass market segments. The complexity of designs is making post-silicon validation extremely expensive. Validation costs exceed design costs, and the number of discovered bugs is growing, both during validation and once products hit the market. Fault localization and diagnosis are the biggest bottlenecks, magnified by huge detection latencies, limited internal observability, and costly server farms to generate test outputs. This thesis explores two directions to address some of the critical challenges introduced by unreliable technologies and by the limitations of current validation approaches. We first explore mechanisms for comprehensively detecting multiple sources of failures in modern processors during their lifetime (including transient, intermittent, permanent and also design bugs). Our solutions embrace a paradigm where fault tolerance is built based on exploiting high-level microarchitectural invariants that are reusable across designs, rather than relying on re-execution or ad-hoc block-level protection. To do so, we decompose the basic functionalities of processors into high-level tasks and propose three novel runtime verification solutions that combined enable global error detection: a computation/register dataflow checker, a memory dataflow checker, and a control flow checker. The techniques use the concept of end-to-end signatures and allow designers to adjust the fault coverage to their needs, by trading-off area, power and performance. Our fault injection studies reveal that our methods provide high coverage levels while causing significantly lower performance, power and area costs than existing techniques. Then, this thesis extends the applicability of the proposed error detection schemes to the validation phases. We present a fault localization and diagnosis solution for the memory dataflow by combining our error detection mechanism, a new low-cost logging mechanism and a diagnosis program. Selected internal activity is continuously traced and kept in a memory-resident log whose capacity can be expanded to suite validation needs. The solution can catch undiscovered bugs, reducing the dependence on simulation farms that compute golden outputs. Upon error detection, the diagnosis algorithm analyzes the log to automatically locate the bug, and also to determine its root cause. Our evaluations show that very high localization coverage and diagnosis accuracy can be obtained at very low performance and area costs. The net result is a simplification of current debugging practices, which are extremely manual, time consuming and cumbersome. Altogether, the integrated solutions proposed in this thesis capacitate the industry to deliver more reliable and correct processors as technology evolves into more complex designs and more vulnerable transistors.El continuo escalado de los transistores junto con los avances microarquitectónicos han posibilitado la presencia de potentes procesadores en todos los segmentos de mercado. Sin embargo, varios problemas de fiabilidad están desafiando la producción de sistemas robustos. Las predicciones de "soft errors" son inquietantes, especialmente para la lógica combinacional: soluciones como ECC o paridad se están volviendo insuficientes a medida que dicha lógica se convierte en la fuente predominante de soft errors. Además, los expertos están alertando acerca de la necesidad de detectar otras fuentes de fallos (causantes de errores permanentes e intermitentes) durante el tiempo de vida de los procesadores. Los segmentos "commodity" son los más vulnerables, ya que imponen unos requisitos que las técnicas actuales de fiabilidad no ofrecen. Estas soluciones (generalmente basadas en re-ejecución) fueron ideadas en un tiempo en el que con tal de alcanzar altos nivel de fiabilidad se asumían grandes costes. Son por tanto necesarias nuevas técnicas que permitan la protección contra fallos en los segmentos más populares. La complejidad de los diseños está encareciendo la validación "post-silicon". Su coste excede el de diseño, y el número de errores descubiertos está aumentando durante la validación y ya en manos de los clientes. La localización y el diagnóstico de errores son los mayores problemas, empeorados por las altas latencias en la manifestación de errores, por la poca observabilidad interna y por el coste de generar las señales esperadas. Esta tesis explora dos direcciones para tratar algunos de los retos causados por la creciente vulnerabilidad hardware y por las limitaciones de los enfoques de validación. Primero exploramos mecanismos para detectar múltiples fuentes de fallos durante el tiempo de vida de los procesadores (errores transitorios, intermitentes, permanentes y de diseño). Nuestras soluciones son de un paradigma donde la fiabilidad se construye explotando invariantes microarquitectónicos genéricos, en lugar de basarse en re-ejecución o en protección ad-hoc. Para ello descomponemos las funcionalidades básicas de un procesador y proponemos tres soluciones de `runtime verification' que combinadas permiten una detección de errores a nivel global. Estas tres soluciones son: un verificador de flujo de datos de registro y de computación, un verificador de flujo de datos de memoria y un verificador de flujo de control. Nuestras técnicas usan el concepto de firmas y permiten a los diseñadores ajustar los niveles de protección a sus necesidades, mediante compensaciones en área, consumo energético y rendimiento. Nuestros estudios de inyección de errores revelan que los métodos propuestos obtienen altos niveles de protección, a la vez que causan menos costes que las soluciones existentes. A continuación, esta tesis explora la aplicabilidad de estos esquemas a las fases de validación. Proponemos una solución de localización y diagnóstico de errores para el flujo de datos de memoria que combina nuestro mecanismo de detección de errores, junto con un mecanismo de logging de bajo coste y un programa de diagnóstico. Cierta actividad interna es continuamente registrada en una zona de memoria cuya capacidad puede ser expandida para satisfacer las necesidades de validación. La solución permite descubrir bugs, reduciendo la necesidad de calcular los resultados esperados. Al detectar un error, el algoritmo de diagnóstico analiza el registro para automáticamente localizar el bug y determinar su causa. Nuestros estudios muestran un alto grado de localización y de precisión de diagnóstico a un coste muy bajo de rendimiento y área. El resultado es una simplificación de las prácticas actuales de depuración, que son enormemente manuales, incómodas y largas. En conjunto, las soluciones de esta tesis capacitan a la industria a producir procesadores más fiables, a medida que la tecnología evoluciona hacia diseños más complejos y más vulnerables

    Contract Testing for Reliable Embedded Systems

    Get PDF
    Embedded systems comprise diverse technologies complicating their design. By creating virtual prototypes of the target system, Electronic System Level Design, the early analysis of a system composed by electronics and software is possible. However, the concrete interaction between hardware modules and between hardware and software is left for late development stages and real prototype making. Generally, interaction between components is assumed to be correct. However, it has to be assumed on development implicitly because interaction between components is not considered in the functionality design. While single components are mostly thoroughly tested and guarantee certain reliability levels, their interaction is based on often underspecified interfaces. Although component usage is mostly specified, operational constraints are often left out. Finally, not only the interaction between components but also with the environment and the user are not ensured. Generally, only functional integration tests are executed and corner-cases are left out, leaving uncovered faults that only manifest as failures later when their cost is higher. Therefore, this work aims at component interaction through specification of interfaces, test generation and real-time test execution. The specification is based on the design-by-contract approach of software that specifies semantics of component interaction in addition to the syntactical definition through functions. In the first part of this work, a specification for the interaction between hardware modules is given. With the automatic real-time test execution, fulfillment of specified preconditions for correct component operation can be checked. In component-based design, the component is trusted and thus, its functionality is assumed to be correct when certain postconditions are specified. In a correct component assembly, component postconditions fulfill preconditions of other components resulting in an operational system. The specification of preconditions follows the definition of environmental properties, acceptable input sequences for interfacing pins, as well as acceptable signal parameters, such as voltage levels, slope times, delays and glitches. Postconditions are defined by the description of a functionality accompanying constraints, such as timing. These parameters are automatically determined on operation by a testing circuit. Parameters that violate the specification are signaled by the testing circuit and failure is detected. The chosen parameters can give hint of the reason for the failure being an evidence of a circuit fault. In the example of an Inter-Integrated Circuit (I2C) communication system, we define contracts and show comparisons between contract violation, fault categorization and failure occurrence under signal fault injection. To complete this work, support for fault analysis on the electronic system level design is given. For this, the data transfers between the high-level models used in the design are augmented with the defined contract parameters. With a specific interface, digital faults are generated for transactions with violating signal parameters that can be tracked by the system. This way, recovery mechanisms for synchronous communication are proposed and tested. In the second part, the interaction between hardware and software is tackled providing special methods for developing device drivers. For this, we do not only specify the interface between hardware and software but also map the hardware control elements to software, partially generating the software interface for a device. This is necessary because drivers handle devices with internal control elements like registers, data streams and interrupts that cannot be represented on software. This systematic composition of drivers facilitates the development of a device interface called the device mechanism. It is the lowest layer of a two-layer architecture for driver development. The device mechanism carries out the access to the device exporting a pure software interface. This interface is based on the device implementation being, thus, fully specified. Further data processing required for compliance with the operating system or application is carried out in the driver policy, the layer on top of it. With the definition of a software layer for device control, contracts specifying constraints of this interface are proposed. These contracts are based on implementation constraints of the device and on its dynamic behavior. Therefore, an extended finite state machine models the dynamic behavior of the device. Based on it, functions of the device mechanism can be augmented with preconditions on the state or on state machine variables. These conditions are then checked on runtime. After execution of a function, its postconditions are ensured, such as timing. This guarantees that different driver policies, operating systems or firmwares, use this same device mechanism fulfilling its constraints. On the example of a Philips webcam, we develop the complete driver for Linux based on our architecture, creating contracts for its device mechanism. Following the systematic composition and the contract approach, driver bugs are avoided that otherwise violate allowed values for device data and execution orders of device protocols

    Dependable Embedded Systems

    Get PDF
    This Open Access book introduces readers to many new techniques for enhancing and optimizing reliability in embedded systems, which have emerged particularly within the last five years. This book introduces the most prominent reliability concerns from today’s points of view and roughly recapitulates the progress in the community so far. Unlike other books that focus on a single abstraction level such circuit level or system level alone, the focus of this book is to deal with the different reliability challenges across different levels starting from the physical level all the way to the system level (cross-layer approaches). The book aims at demonstrating how new hardware/software co-design solution can be proposed to ef-fectively mitigate reliability degradation such as transistor aging, processor variation, temperature effects, soft errors, etc. Provides readers with latest insights into novel, cross-layer methods and models with respect to dependability of embedded systems; Describes cross-layer approaches that can leverage reliability through techniques that are pro-actively designed with respect to techniques at other layers; Explains run-time adaptation and concepts/means of self-organization, in order to achieve error resiliency in complex, future many core systems

    Investigation of radiation-hardened design of electronic systems with applications to post-accident monitoring for nuclear power plants

    Get PDF
    This research aims at improving the robustness of electronic systems used-in high level radiation environments by combining with radiation-hardened (rad-hardened) design and fault-tolerant techniques based on commercial off-the-shelf (COTS) components. A specific of the research is to use such systems for wireless post-accident monitoring in nuclear power plants (NPPs). More specifically, the following methods and systems are developed and investigated to accomplish expected research objectives: analysis of radiation responses, design of a radiation-tolerant system, implementation of a wireless post-accident monitoring system for NPPs, performance evaluation without repeat physical tests, and experimental validation in a radiation environment. A method is developed to analyze ionizing radiation responses of COTS-based devices and circuits in various radiation conditions, which can be applied to design circuits robust to ionizing radiation effects without repeated destructive tests in a physical radiation environment. Some mathematical models of semiconductor devices for post-irradiation conditions are investigated, and their radiation responses are analyzed using Technology Computer Aided Design (TCAD) simulator. Those models are then used in the analysis of circuits and systems under radiation condition. Based on the simulation results, method of rapid power off may be effectively to protect electronic systems under ionizing radiation. It can be a potential solution to mitigate damages of electronic components caused by radiation. With simulation studies of photocurrent responses of semiconductor devices, two methods are presented to mitigate the damages of total ionizing dose: component selection and radiation shielding protection. According to the investigation of radiation-tolerance of regular COTS components, most COTS-based semiconductor components may experience performance degradation and radiation damages when the total dose is greater than 20 K Rad (Si). A principle of component selection is given to obtain the suitable components, as well as a method is proposed to assess the component reliability under radiation environments, which uses radiation degradation factors, instead of the usual failure rate data in the reliability model. Radiation degradation factor is as the input to describe the radiation response of a component under a total radiation dose. In addition, a number of typical semiconductor components are also selected as the candidate components for the application of wireless monitoring in nuclear power plants. On the other hand, a multi-layer shielding protection is used to reduce the total dose to be less than 20 K Rad (Si) for a given radiation condition; the selected semiconductor devices can then survive in the radiation condition with the reduced total dose. The calculation method of required shielding thickness is also proposed to achieve the design objectives. Several shielding solutions are also developed and compared for applications in wireless monitoring system in nuclear power plants. A radiation-tolerant architecture is proposed to allow COTS-based electronic systems to be used in high-level radiation environments without using rad-hardened components. Regular COTS components are used with some fault-tolerant techniques to mitigate damages of the system through redundancy, online fault detection, real-time preventive remedial actions, and rapid power off. The functions of measurement, processing, communication, and fault-tolerance are integrated locally within all channels without additional detection units. A hardware emulation bench with redundant channels is constructed to verify the effectiveness of the developed radiation-tolerant architecture. Experimental results have shown that the developed architecture works effectively and redundant channels can switch smoothly in 500 milliseconds or less when a single fault or multiple faults occur. An online mechanism is also investigated to timely detect and diagnose radiation damages in the developed redundant architecture for its radiation tolerance enhancement. This is implemented by the built-in-test technique. A number of tests by using fault injection techniques have been carried out in the developed hardware emulation bench to validate the proposed detection mechanism. The test results have shown that faults and errors can be effectively detected and diagnosed. For the developed redundant wireless devices under given radiation dose (20 K Rad (Si)), the fault detection coverage is about 62.11%. This level of protection could be improved further by putting more resources (CPU consumption, etc.) into the function of fault detection, but the cost will increase. To apply the above investigated techniques and systems, under a severe accident condition in a nuclear power plant, a prototype of wireless post-accident monitoring system (WPAMS) is designed and constructed. Specifically, the radiation-tolerant wireless device is implemented with redundant and diversified channels. The developed system operates effectively to measure up-to-date information from a specific area/process and to transmit that information to remote monitoring station wirelessly. Hence, the correctness of the proposed architecture and approaches in this research has been successfully validated. In the design phase, an assessment method without performing repeated destructive physical tests is investigated to evaluate the radiation-tolerance of electronic systems by combining the evaluation of radiation protection and the analysis of the system reliability under the given radiation conditions. The results of the assessment studies have shown that, under given radiation conditions, the reliability of the developed radiation-tolerant wireless system can be much higher than those of non-redundant channels; and it can work in high-level radiation environments with total dose up to 1 M Rad (Si). Finally, a number of total dose tests are performed to investigate radiation effects induced by gamma radiation on distinct modern wireless monitoring devices. An experimental setup is developed to monitor the performance of signal measurement online and transmission of the developed distinct wireless electronic devices directly under gamma radiator at The Ohio State University Nuclear Reactor Lab (OSU-NRL). The gamma irradiator generates dose rates of 20 K Rad/h and 200 Rad/h on the samples, respectively. It was found that both measurement and transmission functions of distinct wireless measurement and transmission devices work well under gamma radiation conditions before the devices permanently damage. The experimental results have also shown that the developed radiation-tolerant design can be applied to effectively extend the lifespan of COTS-based electronic systems in the high-level radiation environment, as well as to improve the performance of wireless communication systems. According to testing results, the developed radiation-tolerant wireless device with a shielding protection can work at least 21 hours under the highest dose rate (20 K Rad/h). In summary, this research has addressed important issues on the design of radiation-tolerant systems without using rad-hardened electronic components. The proposed methods and systems provide an effective and economical solution to implement monitoring systems for obtaining up-to-date information in high-level radiation environments. The reported contributions are of significance both academically and in practice

    Methoden und Beschreibungssprachen zur Modellierung und Verifikation vonSchaltungen und Systemen: MBMV 2015 - Tagungsband, Chemnitz, 03. - 04. März 2015

    Get PDF
    Der Workshop Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen (MBMV 2015) findet nun schon zum 18. mal statt. Ausrichter sind in diesem Jahr die Professur Schaltkreis- und Systementwurf der Technischen Universität Chemnitz und das Steinbeis-Forschungszentrum Systementwurf und Test. Der Workshop hat es sich zum Ziel gesetzt, neueste Trends, Ergebnisse und aktuelle Probleme auf dem Gebiet der Methoden zur Modellierung und Verifikation sowie der Beschreibungssprachen digitaler, analoger und Mixed-Signal-Schaltungen zu diskutieren. Er soll somit ein Forum zum Ideenaustausch sein. Weiterhin bietet der Workshop eine Plattform für den Austausch zwischen Forschung und Industrie sowie zur Pflege bestehender und zur Knüpfung neuer Kontakte. Jungen Wissenschaftlern erlaubt er, ihre Ideen und Ansätze einem breiten Publikum aus Wissenschaft und Wirtschaft zu präsentieren und im Rahmen der Veranstaltung auch fundiert zu diskutieren. Sein langjähriges Bestehen hat ihn zu einer festen Größe in vielen Veranstaltungskalendern gemacht. Traditionell sind auch die Treffen der ITGFachgruppen an den Workshop angegliedert. In diesem Jahr nutzen zwei im Rahmen der InnoProfile-Transfer-Initiative durch das Bundesministerium für Bildung und Forschung geförderte Projekte den Workshop, um in zwei eigenen Tracks ihre Forschungsergebnisse einem breiten Publikum zu präsentieren. Vertreter der Projekte Generische Plattform für Systemzuverlässigkeit und Verifikation (GPZV) und GINKO - Generische Infrastruktur zur nahtlosen energetischen Kopplung von Elektrofahrzeugen stellen Teile ihrer gegenwärtigen Arbeiten vor. Dies bereichert denWorkshop durch zusätzliche Themenschwerpunkte und bietet eine wertvolle Ergänzung zu den Beiträgen der Autoren. [... aus dem Vorwort

    A reusable BIST with software assisted repair technology for improved memory and IO debug, validation and test time

    No full text
    corecore