2,635 research outputs found
DeSyRe: on-Demand System Reliability
The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints
Reliability-aware and energy-efficient system level design for networks-on-chip
2015 Spring.Includes bibliographical references.With CMOS technology aggressively scaling into the ultra-deep sub-micron (UDSM) regime and application complexity growing rapidly in recent years, processors today are being driven to integrate multiple cores on a chip. Such chip multiprocessor (CMP) architectures offer unprecedented levels of computing performance for highly parallel emerging applications in the era of digital convergence. However, a major challenge facing the designers of these emerging multicore architectures is the increased likelihood of failure due to the rise in transient, permanent, and intermittent faults caused by a variety of factors that are becoming more and more prevalent with technology scaling. On-chip interconnect architectures are particularly susceptible to faults that can corrupt transmitted data or prevent it from reaching its destination. Reliability concerns in UDSM nodes have in part contributed to the shift from traditional bus-based communication fabrics to network-on-chip (NoC) architectures that provide better scalability, performance, and utilization than buses. In this thesis, to overcome potential faults in NoCs, my research began by exploring fault-tolerant routing algorithms. Under the constraint of deadlock freedom, we make use of the inherent redundancy in NoCs due to multiple paths between packet sources and sinks and propose different fault-tolerant routing schemes to achieve much better fault tolerance capabilities than possible with traditional routing schemes. The proposed schemes also use replication opportunistically to optimize the balance between energy overhead and arrival rate. As 3D integrated circuit (3D-IC) technology with wafer-to-wafer bonding has been recently proposed as a promising candidate for future CMPs, we also propose a fault-tolerant routing scheme for 3D NoCs which outperforms the existing popular routing schemes in terms of energy consumption, performance and reliability. To quantify reliability and provide different levels of intelligent protection, for the first time, we propose the network vulnerability factor (NVF) metric to characterize the vulnerability of NoC components to faults. NVF determines the probabilities that faults in NoC components manifest as errors in the final program output of the CMP system. With NVF aware partial protection for NoC components, almost 50% energy cost can be saved compared to the traditional approach of comprehensively protecting all NoC components. Lastly, we focus on the problem of fault-tolerant NoC design, that involves many NP-hard sub-problems such as core mapping, fault-tolerant routing, and fault-tolerant router configuration. We propose a novel design-time (RESYN) and a hybrid design and runtime (HEFT) synthesis framework to trade-off energy consumption and reliability in the NoC fabric at the system level for CMPs. Together, our research in fault-tolerant NoC routing, reliability modeling, and reliability aware NoC synthesis substantially enhances NoC reliability and energy-efficiency beyond what is possible with traditional approaches and state-of-the-art strategies from prior work
Reinforcement Learning based Fault-Tolerant Routing Algorithm for Mesh based NoC and its FPGA Implementation
Network-on-Chip (NoC) has emerged as the most promising on-chip interconnection framework in Multi-Processor System-on-Chips (MPSoCs) due to its efficiency and scalability. In the deep submicron level, NoCs are vulnerable to faults, which leads to the failure of network components such as links and routers. Failures in NoC components diminish system efficiency and reliability. This paper proposes a Reinforcement Learning based Fault-Tolerant Routing (RL-FTR) algorithm to tackle the routing issues caused by link and router faults in the mesh-based NoC architecture. The efficiency of the proposed RL-FTR algorithm is examined using System-C based cycle-accurate NoC simulator. Simulations are carried out by increasing the number of links and router faults in various sizes of mesh. Followed by simulations, real-time functioning of the proposed RL-FTR algorithm is observed using the FPGA implementation. Results of the simulation and hardware shows that the proposed RL-FTR algorithm provides an optimal routing path from the source router to the destination router.publishedVersio
Fault-Tolerant Application-Specific Topology based NoC and its Prototype on an FPGA
Application-Specific Networks-on-Chips (ASNoCs) are suitable communication platforms for
meeting current application requirements. Interconnection links are the primary components involved in
communication between the cores of an ASNoC design. The integration density in ASNoC increases with
continuous scaling down of the transistor size. Excessive integration density in ASNoC can result in the
formation of thermal hotspots, which can cause a system to fail permanently. As a result, fault-tolerant
techniques are required to address the permanent faults in interconnection links of an ASNoC design.
By taking into account link faults in the topology, this paper introduces a fault-tolerant application-specific
topology-based NoC design and its prototype on an FPGA. To place spare links in the ASNoC topology,
a meta-heuristic algorithm based on Particle Swarm Optimization (PSO) is proposed. By taking link
faults into account in ASNoC design, we also propose an application mapping heuristic and a table-based
fault-tolerant routing algorithm. Experiments are carried out for a specific link and any link fault in
fault-tolerant topologies generated by our approach and approaches reported in the literature. For the experimentation, we used the multi-media applications Picture-in-Picture (PiP), Moving Pictures Expert Group
(MPEG) - 4, MP3Encoder, and Video Object Plane Decoder (VOPD). Experiments are run on software
and hardware platforms. The static performance metric communication cost and the dynamic performance
metrics network latency, throughput, and router power consumption are examined using software platform.
In the hardware platform, the Field Programmable Gate Array (FPGA) is used to validate proposed
fault-tolerant topologies and analyze performance metrics such as application runtime, resource utilization,
and power consumption. The results are compared with the existing approaches, specifically Ring topology
and its modified versions on both software and hardware platforms. The experimental results obtained from
software and hardware platforms for a specific link and any link fault show significant improvements in
performance metrics using our approach when compared with the related works in the literature.publishedVersio
- …