567 research outputs found

    Cross-layer system reliability assessment framework for hardware faults

    Get PDF
    System reliability estimation during early design phases facilitates informed decisions for the integration of effective protection mechanisms against different classes of hardware faults. When not all system abstraction layers (technology, circuit, microarchitecture, software) are factored in such an estimation model, the delivered reliability reports must be excessively pessimistic and thus lead to unacceptably expensive, over-designed systems. We propose a scalable, cross-layer methodology and supporting suite of tools for accurate but fast estimations of computing systems reliability. The backbone of the methodology is a component-based Bayesian model, which effectively calculates system reliability based on the masking probabilities of individual hardware and software components considering their complex interactions. Our detailed experimental evaluation for different technologies, microarchitectures, and benchmarks demonstrates that the proposed model delivers very accurate reliability estimations (FIT rates) compared to statistically significant but slow fault injection campaigns at the microarchitecture level.Peer ReviewedPostprint (author's final draft

    Cross layer reliability estimation for digital systems

    Get PDF
    Forthcoming manufacturing technologies hold the promise to increase multifuctional computing systems performance and functionality thanks to a remarkable growth of the device integration density. Despite the benefits introduced by this technology improvements, reliability is becoming a key challenge for the semiconductor industry. With transistor size reaching the atomic dimensions, vulnerability to unavoidable fluctuations in the manufacturing process and environmental stress rise dramatically. Failing to meet a reliability requirement may add excessive re-design cost to recover and may have severe consequences on the success of a product. %Worst-case design with large margins to guarantee reliable operation has been employed for long time. However, it is reaching a limit that makes it economically unsustainable due to its performance, area, and power cost. One of the open challenges for future technologies is building ``dependable'' systems on top of unreliable components, which will degrade and even fail during normal lifetime of the chip. Conventional design techniques are highly inefficient. They expend significant amount of energy to tolerate the device unpredictability by adding safety margins to a circuit's operating voltage, clock frequency or charge stored per bit. Unfortunately, the additional cost introduced to compensate unreliability are rapidly becoming unacceptable in today's environment where power consumption is often the limiting factor for integrated circuit performance, and energy efficiency is a top concern. Attention should be payed to tailor techniques to improve the reliability of a system on the basis of its requirements, ending up with cost-effective solutions favoring the success of the product on the market. Cross-layer reliability is one of the most promising approaches to achieve this goal. Cross-layer reliability techniques take into account the interactions between the layers composing a complex system (i.e., technology, hardware and software layers) to implement efficient cross-layer fault mitigation mechanisms. Fault tolerance mechanism are carefully implemented at different layers starting from the technology up to the software layer to carefully optimize the system by exploiting the inner capability of each layer to mask lower level faults. For this purpose, cross-layer reliability design techniques need to be complemented with cross-layer reliability evaluation tools, able to precisely assess the reliability level of a selected design early in the design cycle. Accurate and early reliability estimates would enable the exploration of the system design space and the optimization of multiple constraints such as performance, power consumption, cost and reliability. This Ph.D. thesis is devoted to the development of new methodologies and tools to evaluate and optimize the reliability of complex digital systems during the early design stages. More specifically, techniques addressing hardware accelerators (i.e., FPGAs and GPUs), microprocessors and full systems are discussed. All developed methodologies are presented in conjunction with their application to real-world use cases belonging to different computational domains

    GPGPU injector 4.0: A Framework for Architectural Vulnerability Factor (AVF) Assessments Across Nvidia GPUs Generations using GPGPU-Sim 4.0 simulator

    Get PDF
    Η κάρτα γραφικών (GPU) είναι ένας προγραμματιζόμενος επεξεργαστής στον οποίο χιλιάδες πυρήνες επεξεργασίας λειτουργούν ταυτόχρονα σε μαζικό παραλληλισμό, όπου κάθε πυρήνας επικεντρώνεται στην πραγματοποίηση υπολογισμών, διευκολύνοντας την επεξεργασία και την ανάλυση σε πραγματικό χρόνο τεράστιων όγκων δεδομένων. Όλες οι σύγχρονες κάρτες γραφικών είναι επίσης γνωστές και ως κάρτες γραφικών γενικής χρήσης (GPGPU) καθώς μπορούν να προγραμματιστούν ώστε να κατευθύνουν αυτήν την επεξεργαστική ισχύ και προς την αντιμετώπιση επιστημονικών υπολογιστικών αναγκών. Επομένως, η αξιοπιστία του υλικού μιας GPU είναι ένας πολύ σημαντικός παράγοντας που πρέπει να εκτιμήσουν οι αρχιτέκτονες νωρίς στον κύκλο του σχεδιασμού ώστε να σταθμίσουν τα οφέλη των τεχνικών προστασίας από σφάλματα έναντι του κόστους. Σε αυτή τη διπλωματική, παρουσιάζουμε το GPGPU injector 4.0, το οποίο είναι ένα εργαλείο (framework) “εισαγωγής” σφαλμάτων (fault injection) για την εκτίμηση της αξιοπιστίας μιας κάρτας γραφικών σε μικροαρχιτεκτονικό επίπεδο (AVF) και τρέχει πάνω σε ένα γνωστό εργαλείο που προσομοιώνει κάρτες γραφικών: GPGPU-sim. Χρησιμοποιούμε το εργαλείο GPGPU injector 4.0 για την εισαγωγή σφαλμάτων υλικού (transient faults) σε κάρτες γραφικών με δυνατότητα CUDA, πάνω σε δομές υλικού όπως το αρχείο καταχωρητή, την κοινή μνήμη, την L1 προσωρινή μνήμη απλών δεδομένων αλλά και δεδομένων texture και την προσωρινή μνήμη L2. Πιο συγκεκριμένα, υπολογίζουμε το AVF δύο ευρέως χρησιμοποιούμενων πρόσφατων καρτών γραφικών που είναι η RTX 2060 και η Quadro GV100 χρησιμοποιώντας δέκα διαφορετικά CUDA προγράμματα τα οποία εκτελούνται πάνω στον προσομοιωτή σε γλώσσα μηχανής του υλικού (SASS).A (Graphics Processing Unit) GPU is a programmable processor on which thousands of processing cores run simultaneously in massive parallelism, where each core is focused on making efficient calculations, facilitating real-time processing and analysis of enormous datasets. Due to the development of general purpose parallel programming environments and languages, all modern GPUs are general purpose GPUs (GPGPUs) as they can be programmed for non-graphics applications and they can direct their processing power towards massively parallel problems. Therefore, as in all general-purpose computing platforms, accurate reliability on GPU hardware structures is a very important factor that architects need to estimate early in the design cycle to weigh the benefits of error protection techniques against their costs. In this thesis, we introduce GPGPU injector 4.0 which is a fault injection framework for Architectural Vulnerability Factor (AVF) assessment of hardware structures and entire GPU chips that runs over the state-of-the-art performance simulator for Nvidia GPUs architectures: GPGPU-sim. We use GPGPU injector 4.0 for fault injection of transient faults (soft errors) on CUDA enabled GPU architecture. The target hardware structures include the register file, the shared memory, the L1 data/texture cache and the L2 cache which altogether account for several tens of MBs on on-chip GPU storage. More specifically, we compute the AVF of two widely used recent graphic cards which are the RTX 2060 and Quadro GV100 by experimenting with ten different CUDA benchmarks that are simulated on the actual instruction set (SASS)

    Analyzing and Predicting Processor Vulnerability to Soft Errors Using Statistical Techniques

    Get PDF
    The shrinking processor feature size, lower threshold voltage and increasing on-chip transistor density make current processors highly vulnerable to soft errors. Architectural Vulnerability Factor (AVF) reflects the probability that a raw soft error eventually causes a visible error in the program output, indicating the processor’s susceptibility to soft errors at architectural level. The awareness of the AVF, both at the early design stage and during program runtime, is greatly useful for designing reliable processors. However, measuring the AVF is extremely costly, resulting in large overheads in hardware, computation, and power. The situation is further exacerbated in a multi-threaded processor environment where resource contention and data sharing exist among different threads. Consequently, predicting the AVF from other easily-measured metrics becomes extraordinarily attractive to computer designers. We propose a series of AVF modeling and prediction works via using advanced statistical techniques. First, we utilize the Boosted Regression Trees (BRT) scheme to dynamically predict the AVF during program execution from a variety of performance metrics. This correlation is generalized to be across different workloads, program phases, and processor configurations on a single-threaded superscalar processor. Second, the AVF prediction is extended to multi-threaded processors where the inter-thread resource contention shows significant and non-uniform impacts on different programs; we propose a two-level predictive mechanism using BRT as building blocks to characterize the contention behavior. Finally, we employ a rule search strategy named Patient Rule Induction Method (PRIM) to explore a large processor design space at the early design stage. We are capable of generating selective rules on important configuration parameters. These rules quantify the design space subregion yielding lowest values of the response, thereby providing useful guidelines for designing reliable processors while achieving high performance
    corecore