750 research outputs found

    Flexible Spare Core Placement in Torus Topology based NoCs and its validation on an FPGA

    Get PDF
    In the nano-scale era, Network-on-Chip (NoC) interconnection paradigm has gained importance to abide by the communication challenges in Chip Multi-Processors (CMPs). With increased integration density on CMPs, NoC components namely cores, routers, and links are susceptible to failures. Therefore, to improve system reliability, there is a need for efficient fault-tolerant techniques that mitigate permanent faults in NoC based CMPs. There exists several fault-tolerant techniques that address the permanent faults in application cores while placing the spare cores onto NoC topologies. However, these techniques are limited to Mesh topology based NoCs. There are few approaches that have realized the fault-tolerant solutions on an FPGA, but the study on architectural aspects of NoC is limited. This paper presents the flexible placement of spare core onto Torus topology-based NoC design by considering core faults and validating it on an FPGA. In the first phase, a mathematical formulation based on Integer Linear Programming (ILP) and meta-heuristic based Particle Swarm Optimization (PSO) have been proposed for the placement of spare core. In the second phase, we have implemented NoC router addressing scheme, routing algorithm, run-time fault injection model, and fault-tolerant placement of spare core onto Torus topology using an FPGA. Experiments have been done by taking different multimedia and synthetic application benchmarks. This has been done in both static and dynamic simulation environments followed by hardware implementation. In the static simulation environment, the experimentations are carried out by scaling the network size and router faults in the network. The results obtained from our approach outperform the methods such as Fault-tolerant Spare Core Mapping (FSCM), Simulated Annealing (SA), and Genetic Algorithm (GA) proposed in the literature. For the experiments carried out by scaling the network size, our proposed methodology shows an average improvement of 18.83%, 4.55%, 12.12% in communication cost over the approaches FSCM, SA, and GA, respectively. For the experiments carried out by scaling the router faults in the network, our approach shows an improvement of 34.27%, 26.26%, and 30.41% over the approaches FSCM, SA, and GA, respectively. For the dynamic simulations, our approach shows an average improvement of 5.67%, 0.44%, and 3.69%, over the approaches FSCM, SA, and GA, respectively. In the hardware implementation, our approach shows an average improvement of 5.38%, 7.45%, 27.10% in terms of application runtime over the approaches SA, GA, and FSCM, respectively. This shows the superiority of the proposed approach over the approaches presented in the literature.publishedVersio

    Fault and Defect Tolerant Computer Architectures: Reliable Computing With Unreliable Devices

    Get PDF
    This research addresses design of a reliable computer from unreliable device technologies. A system architecture is developed for a fault and defect tolerant (FDT) computer. Trade-offs between different techniques are studied and yield and hardware cost models are developed. Fault and defect tolerant designs are created for the processor and the cache memory. Simulation results for the content-addressable memory (CAM)-based cache show 90% yield with device failure probabilities of 3 x 10(-6), three orders of magnitude better than non fault tolerant caches of the same size. The entire processor achieves 70% yield with device failure probabilities exceeding 10(-6). The required hardware redundancy is approximately 15 times that of a non-fault tolerant design. While larger than current FT designs, this architecture allows the use of devices much more likely to fail than silicon CMOS. As part of model development, an improved model is derived for NAND Multiplexing. The model is the first accurate model for small and medium amounts of redundancy. Previous models are extended to account for dependence between the inputs and produce more accurate results

    DeSyRe: on-Demand System Reliability

    No full text
    The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints

    Decompose and Conquer: Addressing Evasive Errors in Systems on Chip

    Full text link
    Modern computer chips comprise many components, including microprocessor cores, memory modules, on-chip networks, and accelerators. Such system-on-chip (SoC) designs are deployed in a variety of computing devices: from internet-of-things, to smartphones, to personal computers, to data centers. In this dissertation, we discuss evasive errors in SoC designs and how these errors can be addressed efficiently. In particular, we focus on two types of errors: design bugs and permanent faults. Design bugs originate from the limited amount of time allowed for design verification and validation. Thus, they are often found in functional features that are rarely activated. Complete functional verification, which can eliminate design bugs, is extremely time-consuming, thus impractical in modern complex SoC designs. Permanent faults are caused by failures of fragile transistors in nano-scale semiconductor manufacturing processes. Indeed, weak transistors may wear out unexpectedly within the lifespan of the design. Hardware structures that reduce the occurrence of permanent faults incur significant silicon area or performance overheads, thus they are infeasible for most cost-sensitive SoC designs. To tackle and overcome these evasive errors efficiently, we propose to leverage the principle of decomposition to lower the complexity of the software analysis or the hardware structures involved. To this end, we present several decomposition techniques, specific to major SoC components. We first focus on microprocessor cores, by presenting a lightweight bug-masking analysis that decomposes a program into individual instructions to identify if a design bug would be masked by the program's execution. We then move to memory subsystems: there, we offer an efficient memory consistency testing framework to detect buggy memory-ordering behaviors, which decomposes the memory-ordering graph into small components based on incremental differences. We also propose a microarchitectural patching solution for memory subsystem bugs, which augments each core node with a small distributed programmable logic, instead of including a global patching module. In the context of on-chip networks, we propose two routing reconfiguration algorithms that bypass faulty network resources. The first computes short-term routes in a distributed fashion, localized to the fault region. The second decomposes application-aware routing computation into simple routing rules so to quickly find deadlock-free, application-optimized routes in a fault-ridden network. Finally, we consider general accelerator modules in SoC designs. When a system includes many accelerators, there are a variety of interactions among them that must be verified to catch buggy interactions. To this end, we decompose such inter-module communication into basic interaction elements, which can be reassembled into new, interesting tests. Overall, we show that the decomposition of complex software algorithms and hardware structures can significantly reduce overheads: up to three orders of magnitude in the bug-masking analysis and the application-aware routing, approximately 50 times in the routing reconfiguration latency, and 5 times on average in the memory-ordering graph checking. These overhead reductions come with losses in error coverage: 23% undetected bug-masking incidents, 39% non-patchable memory bugs, and occasionally we overlook rare patterns of multiple faults. In this dissertation, we discuss the ideas and their trade-offs, and present future research directions.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147637/1/doowon_1.pd
    • …
    corecore