16 research outputs found

    PyGFI: Analyzing and Enhancing Robustness of Graph Neural Networks Against Hardware Errors

    Full text link
    Graph neural networks (GNNs) have recently emerged as a promising learning paradigm in learning graph-structured data and have demonstrated wide success across various domains such as recommendation systems, social networks, and electronic design automation (EDA). Like other deep learning (DL) methods, GNNs are being deployed in sophisticated modern hardware systems, as well as dedicated accelerators. However, despite the popularity of GNNs and the recent efforts of bringing GNNs to hardware, the fault tolerance and resilience of GNNs have generally been overlooked. Inspired by the inherent algorithmic resilience of DL methods, this paper conducts, for the first time, a large-scale and empirical study of GNN resilience, aiming to understand the relationship between hardware faults and GNN accuracy. By developing a customized fault injection tool on top of PyTorch, we perform extensive fault injection experiments on various GNN models and application datasets. We observe that the error resilience of GNN models varies by orders of magnitude with respect to different models and application datasets. Further, we explore a low-cost error mitigation mechanism for GNN to enhance its resilience. This GNN resilience study aims to open up new directions and opportunities for future GNN accelerator design and architectural optimization

    Power and area efficient clock stretching and critical path reshaping for error resilience

    Get PDF
    Process, voltage and temperature variations are on the rise with technology scaling. Nano-scale technology requires huge design margins to ensure reliable operation. Worst case design margining consumes significant amount of circuits and systems resources. In-situ error detection or correction is an alternative method for cost effective variation tolerance. However, existing in-situ error detection and correction circuits are power and area hungry since they use speculative error management, which gives less power savings at higher error rates. This paper proposes an error resilience technique utilizing available slack in the design. The proposed method uses a clock stretching circuit to relax timing margins on selected critical paths that has sufficient consecutive stage slack. We also propose a power optimization method which reshapes the critical path logic proportionate to the consecutive stage slack. Experimental results show that the proposed method achieves the power and area savings of 40% and 8% respectively compared to the worst case design approach. When compared to the TIMBER error resilience approach, the proposed method saves power more than 74% and area more than 13% at design time. Document type: Articl

    An Inexact Ultra-low Power Bio-signal Processing Architecture With Lightweight Error Recovery

    Get PDF
    The energy efficiency of digital architectures is tightly linked to the voltage level (Vdd) at which they operate. Aggressive voltage scaling is therefore mandatory when ultra-low power processing is required. Nonetheless, the lowest admissible Vdd is oen bounded by reliability concerns, especially since static and dynamic non-idealities are exacerbated in the near-threshold region, imposing costly guard-bands to guarantee correctness under worst-case conditions. A striking alternative, explored in this paper, waives the requirement for unconditional correctness, undergoing more relaxed constraints. First, aer a run-time failure, processing correctly resumes at a later point in time. Second, failures induce a limited Quality-of-Service (QoS) degradation. We focus our investigation on the practical scenario of embedded bio-signal analysis, a domain in which energy efficiency is key, while applications are inherently error-tolerant to a certain degree. Targeting a domain-specific multi-core platform, we present a study of the impact of inexactness on application-visible errors. en, we introduce a novel methodology to manage them, which requires minimal hardware resources and a negligible energy overhead. Experimental evidence show that, by tolerating 900 errors/hour, the resulting inexact platform can achieve an efficiency increase of up to 24%, with a QoS degradation of less than 3%

    Quantitative Performance Evaluation of Uncertainty-Aware Hybrid AADL Designs Using Statistical Model Checking

    Get PDF
    International audience— Architecture Analysis and Design Language (AADL) is widely used for the architecture design and analysis of safety-critical real-time systems. Based on the Hybrid annex which supports continuous behavior modeling, Hybrid AADL enables seamless interactions between embedded control systems and continuous physical environments. Although Hybrid AADL is promising in dependability prediction through analyzable architecture development, the worst-case performance analysis of Hybrid AADL designs can easily lead to an overly pessimistic estimation. So far, Hybrid AADL cannot be used to accurately quantify and reason the overall performance of complex systems which interact with external uncertain environments intensively. To address this problem, this paper proposes a statistical model checking based framework that can perform quantitative evaluation of uncertainty-aware Hybrid AADL designs against various performance queries. Our approach extends Hybrid AADL to support the modeling of environment uncertainties. Furthermore, we propose a set of transformation rules that can automatically translate AADL designs together with designers' requirements into Networks of Priced Timed Automata (NPTA) and performance queries, respectively. Comprehensive experimental results on the Movement Authority (MA) scenario of Chinese Train Control System Level 3 (CTCS-3) demonstrate the effectiveness of our approach

    An Efficient NoC-based Framework To Improve Dataflow Thread Management At Runtime

    Get PDF
    This doctoral thesis focuses on how the application threads that are based on dataflow execution model can be managed at Network-on-Chip (NoC) level. The roots of the dataflow execution model date back to the early 1970’s. Applications adhering to such program execution model follow a simple producer-consumer communication scheme for synchronising parallel thread related activities. In dataflow execution environment, a thread can run if and only if all its required inputs are available. Applications running on a large and complex computing environment can significantly benefit from the adoption of dataflow model. In the first part of the thesis, the work is focused on the thread distribution mechanism. It has been shown that how a scalable hash-based thread distribution mechanism can be implemented at the router level with low overheads. To enhance the support further, a tool to monitor the dataflow threads’ status and a simple, functional model is also incorporated into the design. Next, a software defined NoC has been proposed to manage the distribution of dataflow threads by exploiting its reconfigurability. The second part of this work is focused more on NoC microarchitecture level. Traditional 2D-mesh topology is combined with a standard ring, to understand how such hybrid network topology can outperform the traditional topology (such as 2D-mesh). Finally, a mixed-integer linear programming based analytical model has been proposed to verify if the application threads mapped on to the free cores is optimal or not. The proposed mathematical model can be used as a yardstick to verify the solution quality of the newly developed mapping policy. It is not trivial to provide a complete low-level framework for dataflow thread execution for better resource and power management. However, this work could be considered as a primary framework to which improvements could be carried out

    Cross-Layer Approaches for an Aging-Aware Design of Nanoscale Microprocessors

    Get PDF
    Thanks to aggressive scaling of transistor dimensions, computers have revolutionized our life. However, the increasing unreliability of devices fabricated in nanoscale technologies emerged as a major threat for the future success of computers. In particular, accelerated transistor aging is of great importance, as it reduces the lifetime of digital systems. This thesis addresses this challenge by proposing new methods to model, analyze and mitigate aging at microarchitecture-level and above

    New Logic Synthesis As Nanotechnology Enabler (invited paper)

    Get PDF
    Nanoelectronics comprises a variety of devices whose electrical properties are more complex as compared to CMOS, thus enabling new computational paradigms. The potentially large space for innovation has to be explored in the search for technologies that can support large-scale and high- performance circuit design. Within this space, we analyze a set of emerging technologies characterized by a similar computational abstraction at the design level, i.e., a binary comparator or a majority voter. We demonstrate that new logic synthesis techniques, natively supporting this abstraction, are the technology enablers. We describe models and data-structures for logic design using emerging technologies and we show results of applying new synthesis algorithms and tools. We conclude that new logic synthesis methods are required to both evaluate emerging technologies and to achieve the best results in terms of area, power and performance

    Reliable Software for Unreliable Hardware - A Cross-Layer Approach

    Get PDF
    A novel cross-layer reliability analysis, modeling, and optimization approach is proposed in this thesis that leverages multiple layers in the system design abstraction (i.e. hardware, compiler, system software, and application program) to exploit the available reliability enhancing potential at each system layer and to exchange this information across multiple system layers

    Temperature-Aware Design and Management for 3D Multi-Core Architectures

    Get PDF
    Vertically-integrated 3D multiprocessors systems-on-chip (3D MPSoCs) provide the means to continue integrating more functionality within a unit area while enhancing manufacturing yields and runtime performance. However, 3D MPSoCs incur amplified thermal challenges that undermine the corresponding reliability. To address these issues, several advanced cooling technologies, alongside temperature-aware design-time optimizations and run-time management schemes have been proposed. In this monograph, we provide an overall survey on the recent advances in temperature-aware 3D MPSoC considerations. We explore the recent advanced cooling strategies, thermal modeling frameworks, design-time optimizations and run-time thermal management schemes that are primarily targeted for 3D MPSoCs. Our aim of proposing this survey is to provide a global perspective, highlighting the advancements and drawbacks on the recent state-of-the-ar
    corecore