22 research outputs found

    New techniques for functional testing of microprocessor based systems

    Get PDF
    Electronic devices may be affected by failures, for example due to physical defects. These defects may be introduced during the manufacturing process, as well as during the normal operating life of the device due to aging. How to detect all these defects is not a trivial task, especially in complex systems such as processor cores. Nevertheless, safety-critical applications do not tolerate failures, this is the reason why testing such devices is needed so to guarantee a correct behavior at any time. Moreover, testing is a key parameter for assessing the quality of a manufactured product. Consolidated testing techniques are based on special Design for Testability (DfT) features added in the original design to facilitate test effectiveness. Design, integration, and usage of the available DfT for testing purposes are fully supported by commercial EDA tools, hence approaches based on DfT are the standard solutions adopted by silicon vendors for testing their devices. Tests exploiting the available DfT such as scan-chains manipulate the internal state of the system, differently to the normal functional mode, passing through unreachable configurations. Alternative solutions that do not violate such functional mode are defined as functional tests. In microprocessor based systems, functional testing techniques include software-based self-test (SBST), i.e., a piece of software (referred to as test program) which is uploaded in the system available memory and executed, with the purpose of exciting a specific part of the system and observing the effects of possible defects affecting it. SBST has been widely-studies by the research community for years, but its adoption by the industry is quite recent. My research activities have been mainly focused on the industrial perspective of SBST. The problem of providing an effective development flow and guidelines for integrating SBST in the available operating systems have been tackled and results have been provided on microprocessor based systems for the automotive domain. Remarkably, new algorithms have been also introduced with respect to state-of-the-art approaches, which can be systematically implemented to enrich SBST suites of test programs for modern microprocessor based systems. The proposed development flow and algorithms are being currently employed in real electronic control units for automotive products. Moreover, a special hardware infrastructure purposely embedded in modern devices for interconnecting the numerous on-board instruments has been interest of my research as well. This solution is known as reconfigurable scan networks (RSNs) and its practical adoption is growing fast as new standards have been created. Test and diagnosis methodologies have been proposed targeting specific RSN features, aimed at checking whether the reconfigurability of such networks has not been corrupted by defects and, in this case, at identifying the defective elements of the network. The contribution of my work in this field has also been included in the first suite of public-domain benchmark networks

    On the functional test of the BTB logic in pipelined and superscalar processors

    Get PDF
    Electronic systems are increasingly used for safety-critical applications, where the effects of faults must be taken under control and hopefully avoided. For this purpose, test of manufactured devices is particularly important, both at the end of the production line and during the operational phase. This paper describes a method to test the logic implementing the Branch Prediction Unit in pipelined and superscalar processors when this follows the Branch Target Buffer (BTB) architecture; the proposed approach is functional, i.e., it is based on forcing the processor to execute a suitably devised test program and observing the produced results. Experimental results are provided on the DLX processor, showing that the method can achieve a high value of stuck-at fault coverage while also testing the memory in the BT

    Observation mechanisms for in-field software-based self-test

    Get PDF
    When electronic systems are used in safety critical applications, as in the space, avionic, automotive or biomedical areas, it is required to maintain a very low probability of failures due to faults of any kind. Standards and regulations play a significant role, forcing companies to devise and adopt solutions able to achieve predefined targets in terms of dependability. Different techniques can be used to reduce fault occurrence or to minimize the probability that those faults produce critical failures (e.g., by introducing redundancy). Unfortunately, most of these techniques have a severe impact on the cost of the resulting product and, in some cases, the probability of failures is too large anyway. Hence, a solution commonly used in several scenarios lies on periodically performing a test able to detect the occurrence of any fault before it produces a failure (in-field test). This solution is normally based on forcing the processor inside the Device Under Test to execute a properly written test program, which is able to activate possible faults and to make their effects visible in some observable locations. This approach is also called Software-Based Self-Test, or SBST. If compared with testing in an end of manufacturing scenario, in-field testing has strong limitations in terms of access to the system inputs and outputs because Design for Testability structures and testing equipment are usually not available. As a consequence there are reduced possibilities to activate the faults and to observe their effects. This reduced observability particularly affects the ability to detect performance faults, i.e. faults that modify the timing but not the final value of computations. This kind of faults are hard to detect by only observing the final content of predefined memory locations, that is the usual test result observation method used in-field. Initially, the present work was focused on fault tolerance techniques against transient faults induced by ionizing radiation, the so called Single Event Upsets (SEUs). The main contribution of this early stage of the thesis lies in the experimental validation of the feasibility of achieving a safe system by using an architecture that combines task-level redundancy with already available IP cores, thus minimizing the development time. Task execution is replicated and Memory Protection is used to guarantee that any SEU may affect one and only one of the replicas. A proof of concept implementation was developed and validated using fault injection. Results outline the effectiveness of the architecture, and the overhead analysis shows that the proposed architecture is effective in reducing the resource occupation with respect to N-modular redundancy, at an affordable cost in terms of application execution time. The main part of the thesis is focused on in-field software-based self-test of permanent faults. A set of observation methods exploiting existing or ad-hoc hardware is proposed, aimed at obtaining a better coverage, in particular of performance faults. An extensive quantitative evaluation of the proposed methods is presented, including a comparison with the observation methods traditionally used in end of manufacturing and in-field testing. Results show that the proposed methods are a good complement to the traditionally used final memory content observation. Moreover, they show that an adequate combination of these complementary methods allows for achieving nearly the same fault coverage achieved when continuously observing all the processor outputs, which is an observation method commonly used for production test but usually not available in-field. A very interesting by-product of what is described above is a detailed description of how to compute the fault coverage achieved by functional in-field tests using a conventional fault simulator, a tool that is usually applied in an end of manufacturing testing scenario. Finally, another relevant result in the testing area is a method to detect permanent faults inside the cache coherence logic integrated in each cache controller of a multi-core system, based on the concurrent execution of a test program by the different cores in a coordinated manner. By construction, the method achieves full fault coverage of the static faults in the addressed logic.Cuando se utilizan sistemas electrónicos en aplicaciones críticas como en las áreas biomédica, aeroespacial o automotriz, se requiere mantener una muy baja probabilidad de malfuncionamientos debidos a cualquier tipo de fallas. Los estándares y normas juegan un papel importante, forzando a los desarrolladores a diseñar y adoptar soluciones que sean capaces de alcanzar objetivos predefinidos en cuanto a seguridad y confiabilidad. Pueden utilizarse diferentes técnicas para reducir la ocurrencia de fallas o para minimizar la probabilidad de que esas fallas produzcan mal funcionamientos críticos, por ejemplo a través de la incorporación de redundancia. Lamentablemente, muchas de esas técnicas afectan en gran medida el costo de los productos y, en algunos casos, la probabilidad de malfuncionamiento sigue siendo demasiado alta. En consecuencia, una solución usada a menudo en varios escenarios consiste en realizar periódicamente un test que sea capaz de detectar la ocurrencia de una falla antes de que esta produzca un mal funcionamiento (test en campo). En general, esta solución se basa en forzar a un procesador existente dentro del dispositivo bajo prueba a ejecutar un programa de test que sea capaz de activar las posibles fallas y de hacer que sus efectos sean visibles en puntos observables. A esta metodología también se la llama auto-test basado en software, o en inglés Software-Based Self-Test (SBST). Si se lo compara con un escenario de test de fin de fabricación, el test en campo tiene fuertes limitaciones en términos de posibilidad de acceso a las entradas y salidas del sistema, porque usualmente no se dispone de equipamiento de test ni de la infraestructura de Design for Testability. En consecuencia se tiene menos posibilidades de activar las fallas y de observar sus efectos. Esta observabilidad reducida afecta particularmente la habilidad para detectar fallas de performance, es decir fallas que modifican la temporización pero no el resultado final de los cálculos. Este tipo de fallas es difícil de detectar por la sola observación del contenido final de lugares de memoria, que es el método usual que se utiliza para observar los resultados de un test en campo. Inicialmente, el presente trabajo estuvo enfocado en técnicas para tolerar fallas transitorias inducidas por radiación ionizante, llamadas en inglés Single Event Upsets (SEUs). La principal contribución de esa etapa inicial de la tesis reside en la validación experimental de la viabilidad de obtener un sistema seguro, utilizando una arquitectura que combina redundancia a nivel de tareas con el uso de módulos hardware (IP cores) ya disponibles, que minimiza en consecuencia el tiempo de desarrollo. Se replica la ejecución de las tareas y se utiliza protección de memoria para garantizar que un SEU pueda afectar a lo sumo a una sola de las réplicas. Se desarrolló una implementación para prueba de concepto que fue validada mediante inyección de fallas. Los resultados muestran la efectividad de la arquitectura, y el análisis de los recursos utilizados muestra que la arquitectura propuesta es efectiva en reducir la ocupación con respecto a la redundancia modular con N réplicas, a un costo accesible en términos de tiempo de ejecución. La parte principal de esta tesis se enfoca en el área de auto-test en campo basado en software para la detección de fallas permanentes. Se propone un conjunto de métodos de observación utilizando hardware existente o ad-hoc, con el fin de obtener una mejor cobertura, en particular de las fallas de performance. Se presenta una extensa evaluación cuantitativa de los métodos propuestos, que incluye una comparación con los métodos tradicionalmente utilizados en tests de fin de fabricación y en campo. Los resultados muestran que los métodos propuestos son un buen complemento del método tradicionalmente usado que consiste en observar el valor final del contenido de memoria. Además muestran que una adecuada combinación de estos métodos complementarios permite alcanzar casi los mismos valores de cobertura de fallas que se obtienen mediante la observación continua de todas las salidas del procesador, método comúnmente usado en tests de fin de fabricación, pero que usualmente no está disponible en campo. Un subproducto muy interesante de lo arriba expuesto es la descripción detallada del procedimiento para calcular la cobertura de fallas lograda mediante tests funcionales en campo por medio de un simulador de fallas convencional, una herramienta que usualmente se aplica en escenarios de test de fin de fabricación. Finalmente, otro resultado relevante en el área de test es un método para detectar fallas permanentes dentro de la lógica de coherencia de cache que está integrada en el controlador de cache de cada procesador en un sistema multi procesador. El método está basado en la ejecución de un programa de test en forma coordinada por parte de los diferentes procesadores. Por construcción, el método cubre completamente las fallas de la lógica mencionad

    Architectures for dependable modern microprocessors

    Get PDF
    Η εξέλιξη των ολοκληρωμένων κυκλωμάτων σε συνδυασμό με τους αυστηρούς χρονικούς περιορισμούς καθιστούν την επαλήθευση της ορθής λειτουργίας των επεξεργαστών μία εξαιρετικά απαιτητική διαδικασία. Με κριτήριο το στάδιο του κύκλου ζωής ενός επεξεργαστή, από την στιγμή κατασκευής των πρωτοτύπων και έπειτα, οι τεχνικές ελέγχου ορθής λειτουργίας διακρίνονται στις ακόλουθες κατηγορίες: (1) Silicon Debug: Τα πρωτότυπα ολοκληρωμένα κυκλώματα ελέγχονται εξονυχιστικά, (2) Manufacturing Testing: ο τελικό ποιοτικός έλεγχος και (3) In-field verification: Περιλαμβάνει τεχνικές, οι οποίες διασφαλίζουν την λειτουργία του επεξεργαστή σύμφωνα με τις προδιαγραφές του. Η διδακτορική διατριβή προτείνει τα ακόλουθα: (1) Silicon Debug: Η εργασία αποσκοπεί στην επιτάχυνση της διαδικασίας ανίχνευσης σφαλμάτων και στον αυτόματο εντοπισμό τυχαίων προγραμμάτων που δεν περιέχουν νέα -χρήσιμη- πληροφορία σχετικά με την αίτια ενός σφάλματος. Η κεντρική ιδέα αυτής της μεθόδου έγκειται στην αξιοποίηση της έμφυτης ποικιλομορφίας των αρχιτεκτονικών συνόλου εντολών και στην δυνατότητα από-διαμόρφωσης τμημάτων του κυκλώματος, (2) Manufacturing Testing: προτείνεται μία μέθοδο για την βελτιστοποίηση του έλεγχου ορθής λειτουργίας των πολυνηματικών και πολυπύρηνων επεξεργαστών μέσω της χρήση λογισμικού αυτοδοκιμής, (3) Ιn-field verification: Αναλύθηκε σε βάθος η επίδραση που έχουν τα μόνιμα σφάλματα σε μηχανισμούς αύξησης της απόδοσης. Επιπρόσθετα, προτάθηκαν τεχνικές για την ανίχνευση και ανοχή μόνιμων σφαλμάτων υλικού σε μηχανισμούς πρόβλεψης διακλάδωσης.Technology scaling, extreme chip integration and the compelling requirement to diminish the time-to-market window, has rendered microprocessors more prone to design bugs and hardware faults. Microprocessor validation is grouped into the following categories, based on where they intervene in a microprocessor’s lifecycle: (a) Silicon debug: the first hardware prototypes are exhaustively validated, (b) Μanufacturing testing: the final quality control during massive production, and (c) In-field verification: runtime error detection techniques to guarantee correct operation. The contributions of this thesis are the following: (1) Silicon debug: We propose the employment of deconfigurable microprocessor architectures along with a technique to generate self-checking random test programs to avoid the simulation step and triage the redundant debug sessions, (2) Manufacturing testing: We propose a self-test optimization strategy for multithreaded, multicore microprocessors to speedup test program execution time and enhance the fault coverage of hard errors; and (3) In-field verification: We measure the effect of permanent faults performance components. Then, we propose a set of low-cost mechanisms for the detection, diagnosis and performance recovery in the front-end speculative structures. This thesis introduces various novel methodologies to address the validation challenges posed throughout the life-cycle of a chip

    On the On-line Functional Test of the Reorder Buffer Memory in Superscalar Processors

    Get PDF
    The Reorder Buffer (ROB) is a key component in superscalar processors. It enables both in-order commitment of instructions and precise exception management even in those architectures that support out-of-order execution. The ROB architecture typically includes a memory array whose size may reach several thousands of bits. Testing this array may be important to guarantee the correct behavior of the processor. Proprietary BIST solutions typically adopted by manufacturers for end-of-production test are not always suitable for on-line test. In fact, they require the usage of test infrastructures that may be expensive, or may not be accessible and/or documented. This paper proposes an alternative solution, based on a functional approach, which has been validated resorting to both an architectural and a memory fault simulato

    Fault Detection Methodology for Caches in Reliable Modern VLSI Microprocessors based on Instruction Set Architectures

    Get PDF
    Η παρούσα διδακτορική διατριβή εισάγει μία χαμηλού κόστους μεθοδολογία για την ανίχνευση ελαττωμάτων σε μικρές ενσωματωμένες κρυφές μνήμες που βασίζεται σε σύγχρονες Αρχιτεκτονικές Συνόλου Εντολών και εφαρμόζεται με λογισμικό αυτοδοκιμής. Η προτεινόμενη μεθοδολογία εφαρμόζει αλγορίθμους March μέσω λογισμικού για την ανίχνευση τόσο ελαττωμάτων αποθήκευσης όταν εφαρμόζεται σε κρυφές μνήμες που περιέχουν μόνο στατικές μνήμες τυχαίας προσπέλασης όπως για παράδειγμα κρυφές μνήμες επιπέδου 1, όσο και ελαττωμάτων σύγκρισης όταν εφαρμόζεται σε κρυφές μνήμες που περιέχουν εκτός από SRAM μνήμες και μνήμες διευθυνσιοδοτούμενες μέσω περιεχομένου, όπως για παράδειγμα πλήρως συσχετιστικές κρυφές μνήμες αναζήτησης μετάφρασης. Η προτεινόμενη μεθοδολογία εφαρμόζεται και στις τρεις οργανώσεις συσχετιστικότητας κρυφής μνήμης και είναι ανεξάρτητη της πολιτικής εγγραφής στο επόμενο επίπεδο της ιεραρχίας. Η μεθοδολογία αξιοποιεί υπάρχοντες ισχυρούς μηχανισμούς των μοντέρνων ISAs χρησιμοποιώντας ειδικές εντολές, που ονομάζονται στην παρούσα διατριβή Εντολές Άμεσης Προσπέλασης Κρυφής Μνήμης (Direct Cache Access Instructions - DCAs). Επιπλέον, η προτεινόμενη μεθοδολογία εκμεταλλεύεται τους έμφυτους μηχανισμούς καταγραφής απόδοσης και τους μηχανισμούς χειρισμού παγίδων που είναι διαθέσιμοι στους σύγχρονους επεξεργαστές. Επιπρόσθετα, η προτεινόμενη μεθοδολογία εφαρμόζει την λειτουργία σύγκρισης των αλγορίθμων March όταν αυτή απαιτείται (για μνήμες CAM) και επαληθεύει το αποτέλεσμα του ελέγχου μέσω σύντομης απόκρισης, ώστε να είναι συμβατή με τις απαιτήσεις του ελέγχου εντός λειτουργίας. Τέλος, στη διατριβή προτείνεται μία βελτιστοποίηση της μεθοδολογίας για πολυνηματικές, πολυπύρηνες αρχιτεκτονικές.The present PhD thesis introduces a low cost fault detection methodology for small embedded cache memories that is based on modern Instruction Set Architectures and is applied with Software-Based Self-Test (SBST) routines. The proposed methodology applies March tests through software to detect both storage faults when applied to caches that comprise Static Random Access Memories (SRAM) only, e.g. L1 caches, and comparison faults when applied to caches that apart from SRAM memories comprise Content Addressable Memories (CAM) too, e.g. Translation Lookaside Buffers (TLBs). The proposed methodology can be applied to all three cache associativity organizations: direct mapped, set-associative and full-associative and it does not depend on the cache write policy. The methodology leverages existing powerful mechanisms of modern ISAs by utilizing instructions that we call in this PhD thesis Direct Cache Access (DCA) instructions. Moreover, our methodology exploits the native performance monitoring hardware and the trap handling mechanisms which are available in modern microprocessors. Moreover, the proposed Methodology applies March compare operations when needed (for CAM arrays) and verifies the test result with a compact response to comply with periodic on-line testing needs. Finally, a multithreaded optimization of the proposed methodology that targets multithreaded, multicore architectures is also presented in this thesi

    A Functional Approach for Testing the Reorder Buffer Memory

    Get PDF
    Superscalar processors may have the ability to execute instructions out-of-order to better exploit the internal hardware and to maximize the performance. To maintain the in-order instructions commitment and to guarantee the correctness of the final results (as well as precise exception management), the Reorder Buffer (ROB) is used. From the architectural point of view, the ROB is a memory array of several thousands of bits that must be tested against hardware faults to ensure a correct behavior of the processor. Since it is deeply embedded within the microprocessor circuitry, the most straightforward approach to test the ROB is through Built-In Self-Test solutions, which are typically adopted by manufacturers for end-of-production test. However, these solutions may not always be used for the test during the operational phase (in-field test) which aims at detecting possible hardware faults arising when the electronic systems works in its target environment. In fact, these solutions require the usage of test infrastructures that may not be accessible and/or documented, or simply not usable during the operational phase. This paper proposes an alternative solution, based on a functional approach, in which the test is performed by forcing the processor to execute a specially written test program, and checking the behavior of the processor. This approach can be adopted for in-field test, e.g., at the power-on, power-off, or during the time slots unused by the system application. The method has been validated resorting to both an architectural and a memory fault simulator

    On the use of embedded debug features for permanent and transient fault resilience in microprocessors

    Get PDF
    Microprocessor-based systems are employed in an increasing number of applications where dependability is a major constraint. For this reason detecting faults arising during normal operation while introducing the least possible penalties is a main concern. Different forms of redundancy have been employed to ensure error-free behavior, while error detection mechanisms can be employed where some detection latency is tolerated. However, the high complexity and the low observability of microprocessors internal resources make the identification of adequate on-line error detection strategies a very challenging task, which can be tackled at circuit or system level. Concerning system-level strategies, a common limitation is in the mechanism used to monitor program execution and then detect errors as soon as possible, so as to reduce their impact on the application. In this work, an on-line error detection approach based on the reuse of available debugging infrastructures is proposed. The approach can be applied to different system architectures profiting from the debug trace port available in most of current microprocessors to observe possible misbehaviors. Two microprocessors have been used to study the applicability of the solution. LEON3 and ARM7TDMI. Results show that the presented fault detection technique enhances observability and thus error detection abilities in microprocessor-based systems without requiring modifications on the core architecture

    Fault Tolerant Electronic System Design

    Get PDF
    Due to technology scaling, which means reduced transistor size, higher density, lower voltage and more aggressive clock frequency, VLSI devices may become more sensitive against soft errors. Especially for those devices used in safety- and mission-critical applications, dependability and reliability are becoming increasingly important constraints during the development of system on/around them. Other phenomena (e.g., aging and wear-out effects) also have negative impacts on reliability of modern circuits. Recent researches show that even at sea level, radiation particles can still induce soft errors in electronic systems. On one hand, processor-based system are commonly used in a wide variety of applications, including safety-critical and high availability missions, e.g., in the automotive, biomedical and aerospace domains. In these fields, an error may produce catastrophic consequences. Thus, dependability is a primary target that must be achieved taking into account tight constraints in terms of cost, performance, power and time to market. With standards and regulations (e.g., ISO-26262, DO-254, IEC-61508) clearly specify the targets to be achieved and the methods to prove their achievement, techniques working at system level are particularly attracting. On the other hand, Field Programmable Gate Array (FPGA) devices are becoming more and more attractive, also in safety- and mission-critical applications due to the high performance, low power consumption and the flexibility for reconfiguration they provide. Two types of FPGAs are commonly used, based on their configuration memory cell technology, i.e., SRAM-based and Flash-based FPGA. For SRAM-based FPGAs, the SRAM cells of the configuration memory highly susceptible to radiation induced effects which can leads to system failure; and for Flash-based FPGAs, even though their non-volatile configuration memory cells are almost immune to Single Event Upsets induced by energetic particles, the floating gate switches and the logic cells in the configuration tiles can still suffer from Single Event Effects when hit by an highly charged particle. So analysis and mitigation techniques for Single Event Effects on FPGAs are becoming increasingly important in the design flow especially when reliability is one of the main requirements

    Methodologies for Accelerated Analysis of the Reliability and the Energy Efficiency Levels of Modern Microprocessor Architectures

    Get PDF
    Η εξέλιξη της τεχνολογίας ημιαγωγών, της αρχιτεκτονικής υπολογιστών και της σχεδίασης οδηγεί σε αύξηση της απόδοσης των σύγχρονων μικροεπεξεργαστών, η οποία επίσης συνοδεύεται από αύξηση της ευπάθειας των προϊόντων. Οι σχεδιαστές εφαρμόζουν διάφορες τεχνικές κατά τη διάρκεια της ζωής των ολοκληρωμένων κυκλωμάτων με σκοπό να διασφαλίσουν τα υψηλά επίπεδα αξιοπιστίας των παραγόμενων προϊόντων και να τα προστατέψουν από διάφορες κατηγορίες σφαλμάτων διασφαλίζοντας την ορθή λειτουργία τους. Αυτή η διδακτορική διατριβή προτείνει καινούριες μεθόδους για να διασφαλίσει τα υψηλά επίπεδα αξιοπιστίας και ενεργειακής απόδοσης των σύγχρονων μικροεπεξεργαστών οι οποίες μπορούν να εφαρμοστούν κατά τη διάρκεια του πρώιμου σχεδιαστικού σταδίου, του σταδίου παραγωγής ή του σταδίου της κυκλοφορίας των ολοκληρωμένων κυκλωμάτων στην αγορά. Οι συνεισφορές αυτής της διατριβής μπορούν να ομαδοποιηθούν στις ακόλουθες δύο κατηγορίες σύμφωνα με το στάδιο της ζωής των μικροεπεξεργαστών στο οποίο εφαρμόζονται: • Πρώιμο σχεδιαστικό στάδιο: Η στατιστική εισαγωγή σφαλμάτων σε δομές που είναι μοντελοποιημένες σε προσομοιωτές οι οποίοι στοχεύουν στην μελέτη της απόδοσης είναι μια επιστημονικά καθιερωμένη μέθοδος για την ακριβή μέτρηση της αξιοπιστίας, αλλά υστερεί στον αργό χρόνο εκτέλεσης. Σε αυτή τη διατριβή, αρχικά παρουσιάζουμε ένα νέο πλήρως αυτοματοποιημένο εργαλείο εισαγωγής σφαλμάτων σε μικροαρχιτεκτονικό επίπεδο που στοχεύει στην ακριβή αξιολόγηση της αξιοπιστίας ενός μεγάλου πλήθους μονάδων υλικού σε σχέση με διάφορα μοντέλα σφαλμάτων (παροδικά, διακοπτόμενα, μόνιμα σφάλματα). Στη συνέχεια, χρησιμοποιώντας το ίδιο εργαλείο και στοχεύοντας τα παροδικά σφάλματα, παρουσιάζουμε διάφορες μελέτες σχετιζόμενες με την αξιοπιστία και την απόδοση, οι οποίες μπορούν να βοηθήσουν τις σχεδιαστικές αποφάσεις στα πρώιμα στάδια της ζωής των επεξεργαστών. Τελικά, προτείνουμε δύο μεθοδολογίες για να επιταχύνουμε τα μαζικά πειράματα στατιστικής εισαγωγής σφαλμάτων. Στην πρώτη, επιταχύνουμε τα πειράματα έπειτα από την πραγματική εισαγωγή των σφαλμάτων στις δομές του υλικού. Στη δεύτερη, επιταχύνουμε ακόμη περισσότερο τα πειράματα προτείνοντας τη μεθοδολογία με όνομα MeRLiN, η οποία βασίζεται στη μείωση της αρχικής λίστας σφαλμάτων μέσω της ομαδοποίησής τους σε ισοδύναμες ομάδες έπειτα από κατηγοριοποίηση σύμφωνα με την εντολή που τελικά προσπελαύνει τη δομή που φέρει το σφάλμα. • Παραγωγικό στάδιο και στάδιο κυκλοφορίας στην αγορά: Οι συνεισφορές αυτής της διδακτορικής διατριβής σε αυτά τα στάδια της ζωής των μικροεπεξεργαστών καλύπτουν δύο σημαντικά επιστημονικά πεδία. Αρχικά, χρησιμοποιώντας το ολοκληρωμένο κύκλωμα των 48 πυρήνων με ονομασία Intel SCC, προτείνουμε μια τεχνική επιτάχυνσης του εντοπισμού μονίμων σφαλμάτων που εφαρμόζεται κατά τη διάρκεια λειτουργίας αρχιτεκτονικών με πολλούς πυρήνες, η οποία εκμεταλλεύεται το δίκτυο υψηλής ταχύτητας μεταφοράς μηνυμάτων που διατίθεται στα ολοκληρωμένα κυκλώματα αυτού του είδους. Δεύτερον, προτείνουμε μια λεπτομερή στατιστική μεθοδολογία με σκοπό την ακριβή πρόβλεψη σε επίπεδο συστήματος των ασφαλών ορίων λειτουργίας της τάσης των πυρήνων τύπου ARMv8 που βρίσκονται πάνω στη CPU X-Gene 2.The evolution in semiconductor manufacturing technology, computer architecture and design leads to increase in performance of modern microprocessors, which is also accompanied by increase in products’ vulnerability to errors. Designers apply different techniques throughout microprocessors life-time in order to ensure the high reliability requirements of the delivered products that are defined as their ability to avoid service failures that are more frequent and more severe than is acceptable. This thesis proposes novel methods to guarantee the high reliability and energy efficiency requirements of modern microprocessors that can be applied during the early design phase, the manufacturing phase or after the chips release to the market. The contributions of this thesis can be grouped in the two following categories according to the phase of the CPUs lifecycle that are applied at: • Early design phase: Statistical fault injection using microarchitectural structures modeled in performance simulators is a state-of-the-art method to accurately measure the reliability, but suffers from low simulation throughput. In this thesis, we firstly present a novel fully-automated versatile microarchitecture-level fault injection framework (called MaFIN) for accurate characterization of a wide range of hardware components of an x86-64 microarchitecture with respect to various fault models (transient, intermittent, permanent faults). Next, using the same tool and focusing on transient faults, we present several reliability and performance related studies that can assist design decision in the early design phases. Moreover, we propose two methodologies to accelerate the statistical fault injection campaigns. In the first one, we accelerate the fault injection campaigns after the actual injection of the faults in the simulated hardware structures. In the second, we further accelerate the microarchitecture level fault injection campaigns by proposing MeRLiN a fault pre-processing methodology that is based on the pruning of the initial fault list by grouping the faults in equivalent classes according to the instruction access patterns to hardware entries. • Manufacturing phase and release to the market: The contributions of this thesis in these phases of microprocessors life-cycle cover two important aspects. Firstly, using the 48-core Intel’s SCC architecture, we propose a technique to accelerate online error detection of permanent faults for many-core architectures by exploiting their high-speed message passing on-chip network. Secondly, we propose a comprehensive statistical analysis methodology to accurately predict at the system level the safe voltage operation margins of the ARMv8 cores of the X- Gene 2 chip when it operates in scaled voltage conditions
    corecore