43 research outputs found

    TFI-FTS: An efficient transient fault injection and fault-tolerant system for asynchronous circuits on FPGA platform

    Get PDF
    Designing VLSI digital circuits is challenging tasks because of testing the circuits concerning design time. The reliability and productivity of digital integrated circuits are primarily affected by the defects in the manufacturing process or systems. If the defects are more in the systems, which leads the fault in the systems. The fault tolerant systems are necessary to overcome the faults in the VLSI digital circuits. In this research article, an asynchronous circuits based an effective transient fault injection (TFI) and fault tolerant system (FTS) are modelled. The TFI system generates the faults based on BMA based LFSR with faulty logic insertion and one hot encoded register. The BMA based LFSR reduces the hardware complexity with less power consumption on-chip than standard LFSR method. The FTS uses triple mode redundancy (TMR) based majority voter logic (MVL) to tolerant the faults for asynchronous circuits. The benchmarked 74X-series circuits are considered as an asynchronous circuit for TMR logic. The TFI-FTS module is modeled using Verilog-HDL on Xilinx-ISE and synthesized on hardware platform. The Performance parameters are tabulated for TFI-FTS based asynchronous circuits. The performance of TFI-FTS Module is analyzed with 100% fault coverage. The fault coverage is validated using functional simulation of each asynchronous circuit with fault injection in TFI-FTS Module

    Savior: A Reliable Fault Resilient Router Architecture for Network-on-Chip

    Full text link
    [EN] The router plays an important role in communication among different processing cores in on-chip networks. Technology scaling on one hand has enabled the designers to integrate multiple processing components on a single chip; on the other hand, it becomes the reason for faults. A generic router consists of the buffers and pipeline stages. A single fault may result in an undesirable situation of degraded performance or a whole chip may stop working. Therefore, it is necessary to provide permanent fault tolerance to all the components of the router. In this paper, we propose a mechanism that can tolerate permanent faults that occur in the router. We exploit the fault-tolerant techniques of resource sharing and paring between components for the input port unit and routing computation (RC) unit, the resource borrowing for virtual channel allocator (VA) and multiple paths for switch allocator (SA) and crossbar (XB). The experimental results and analysis show that the proposed mechanism enhances the reliability of the router architecture towards permanent faults at the cost of 29% area overhead. The proposed router architecture achieves the highest Silicon Protection Factor (SPF) metric, which is 24.4 as compared to the state-of-the-art fault-tolerant architectures. It incurs an increase in latency for SPLASH2 and PARSEC benchmark traffics, which is minimal as compared to the baseline router.This work was supported by the Spanish 'Ministerio de Ciencia Innovacion y Universidades' and FEDER program in the framework of the 'Proyectos de I+D d Generacion de Conocimiento del Programa Estatal de Generacion de Conocimiento y Fortalecimiento Cientifico y Tecnologico del Sistema de I+D+i, Subprograma Estatal de Generacion de Conocimiento' (ref: PGC2018-095747-B-I00).Hussain, A.; Irfan, M.; Baloch, NK.; Draz, U.; Ali, T.; Glowacz, A.; Dunai, L.... (2020). Savior: A Reliable Fault Resilient Router Architecture for Network-on-Chip. Electronics. 9(11):1-18. https://doi.org/10.3390/electronics9111783S118911Borkar, S. (1999). Design challenges of technology scaling. IEEE Micro, 19(4), 23-29. doi:10.1109/40.782564Latif, K., Rahmani, A.-M., Nigussie, E., Seceleanu, T., Radetzki, M., & Tenhunen, H. (2013). Partial Virtual Channel Sharing: A Generic Methodology to Enhance Resource Management and Fault Tolerance in Networks-on-Chip. Journal of Electronic Testing, 29(3), 431-452. doi:10.1007/s10836-013-5389-5Borkar, S. (2005). Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation. IEEE Micro, 25(6), 10-16. doi:10.1109/mm.2005.110Ali, T., Noureen, J., Draz, U., Shaf, A., Yasin, S., & Ayaz, M. (2018). Participants Ranking Algorithm for Crowdsensing in Mobile Communication. ICST Transactions on Scalable Information Systems, 5(16), 154476. doi:10.4108/eai.13-4-2018.154476Ali, T., Draz, U., Yasin, S., Noureen, J., shaf, A., & Zardari, M. (2018). An Efficient Participant’s Selection Algorithm for Crowdsensing. International Journal of Advanced Computer Science and Applications, 9(1). doi:10.14569/ijacsa.2018.090154Poluri, P., & Louri, A. (2016). Shield: A Reliable Network-on-Chip Router Architecture for Chip Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 27(10), 3058-3070. doi:10.1109/tpds.2016.2521641Valinataj, M., & Shahiri, M. (2016). A low-cost, fault-tolerant and high-performance router architecture for on-chip networks. Microprocessors and Microsystems, 45, 151-163. doi:10.1016/j.micpro.2016.04.009Kim, J., Nicopoulos, C., Park, D., Narayanan, V., Yousif, M. S., & Das, C. R. (2006). A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks. ACM SIGARCH Computer Architecture News, 34(2), 4-15. doi:10.1145/1150019.1136487Polian, I., & Hayes, J. P. (2011). Selective Hardening: Toward Cost-Effective Error Tolerance. IEEE Design & Test of Computers, 28(3), 54-63. doi:10.1109/mdt.2010.120Mohammed, H., Flayyih, W., & Rokhani, F. (2019). Tolerating Permanent Faults in the Input Port of the Network on Chip Router. Journal of Low Power Electronics and Applications, 9(1), 11. doi:10.3390/jlpea9010011Wang, L., Ma, S., Li, C., Chen, W., & Wang, Z. (2017). A high performance reliable NoC router. Integration, 58, 583-592. doi:10.1016/j.vlsi.2016.10.016Shafique, M. A., Baloch, N. K., Baig, M. I., Hussain, F., Zikria, Y. B., & Kim, S. W. (2020). NoCGuard: A Reliable Network-on-Chip Router Architecture. Electronics, 9(2), 342. doi:10.3390/electronics9020342Poluri, P., & Louri, A. (2015). A Soft Error Tolerant Network-on-Chip Router Pipeline for Multi-Core Systems. IEEE Computer Architecture Letters, 14(2), 107-110. doi:10.1109/lca.2014.2360686Feng, C., Lu, Z., Jantsch, A., Zhang, M., & Xing, Z. (2013). Addressing Transient and Permanent Faults in NoC With Efficient Fault-Tolerant Deflection Router. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 21(6), 1053-1066. doi:10.1109/tvlsi.2012.2204909Liu, J., Harkin, J., Li, Y., & Maguire, L. P. (2016). Fault-Tolerant Networks-on-Chip Routing With Coarse and Fine-Grained Look-Ahead. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 35(2), 260-273. doi:10.1109/tcad.2015.2459050Runge, A. (2015). FaFNoC: A Fault-tolerant and Bufferless Network-on-chip. Procedia Computer Science, 56, 397-402. doi:10.1016/j.procs.2015.07.226Binkert, N., Beckmann, B., Black, G., Reinhardt, S. K., Saidi, A., Basu, A., … Wood, D. A. (2011). The gem5 simulator. ACM SIGARCH Computer Architecture News, 39(2), 1-7. doi:10.1145/2024716.202471

    Fault-tolerant networks-on-chip routing with coarse and fine-grained look-ahead

    Get PDF
    Fault tolerance and adaptive capabilities are challenges for modern networks-on-chip (NoC) due to the increase in physical defects in advanced manufacturing processes. Two novel adaptive routing algorithms, namely coarse and fine-grained (FG) look-ahead algorithms, are proposed in this paper to enhance 2-D mesh/torus NoC system fault-tolerant capabilities. These strategies use fault flag codes from neighboring nodes to obtain the status or conditions of real-time traffic in an NoC region, then calculate the path weights and choose the route to forward packets. This approach enables the router to minimize congestion for the adjacent connected channels and also to bypass a path with faulty channels by looking ahead at distant neighboring router paths. The novelty of the proposed routing algorithms is the weighted path selection strategies, which make near-optimal routing decisions to maintain the NoC system performance under high fault rates. Results show that the proposed routing algorithms can achieve performance improvement compared to other state of the art works under various traffic loads and high fault rates. The routing algorithm with FG look-ahead capability achieves a higher throughput compared with the coarse-grained approach under complex fault patterns. The hardware area/power overheads of both routing approaches are relatively low which does not prohibit scalability for large-scale NoC implementations

    On Fault Tolerance Methods for Networks-on-Chip

    Get PDF
    Technology scaling has proceeded into dimensions in which the reliability of manufactured devices is becoming endangered. The reliability decrease is a consequence of physical limitations, relative increase of variations, and decreasing noise margins, among others. A promising solution for bringing the reliability of circuits back to a desired level is the use of design methods which introduce tolerance against possible faults in an integrated circuit. This thesis studies and presents fault tolerance methods for network-onchip (NoC) which is a design paradigm targeted for very large systems-onchip. In a NoC resources, such as processors and memories, are connected to a communication network; comparable to the Internet. Fault tolerance in such a system can be achieved at many abstraction levels. The thesis studies the origin of faults in modern technologies and explains the classification to transient, intermittent and permanent faults. A survey of fault tolerance methods is presented to demonstrate the diversity of available methods. Networks-on-chip are approached by exploring their main design choices: the selection of a topology, routing protocol, and flow control method. Fault tolerance methods for NoCs are studied at different layers of the OSI reference model. The data link layer provides a reliable communication link over a physical channel. Error control coding is an efficient fault tolerance method especially against transient faults at this abstraction level. Error control coding methods suitable for on-chip communication are studied and their implementations presented. Error control coding loses its effectiveness in the presence of intermittent and permanent faults. Therefore, other solutions against them are presented. The introduction of spare wires and split transmissions are shown to provide good tolerance against intermittent and permanent errors and their combination to error control coding is illustrated. At the network layer positioned above the data link layer, fault tolerance can be achieved with the design of fault tolerant network topologies and routing algorithms. Both of these approaches are presented in the thesis together with realizations in the both categories. The thesis concludes that an optimal fault tolerance solution contains carefully co-designed elements from different abstraction levelsSiirretty Doriast

    Run-time management of many-core SoCs: A communication-centric approach

    Get PDF
    The single core performance hit the power and complexity limits in the beginning of this century, moving the industry towards the design of multi- and many-core system-on-chips (SoCs). The on-chip communication between the cores plays a criticalrole in the performance of these SoCs, with power dissipation, communication latency, scalability to many cores, and reliability against the transistor failures as the main design challenges. Accordingly, we dedicate this thesis to the communicationcentered management of the many-core SoCs, with the goal to advance the state-ofthe-art in addressing these challenges. To this end, we contribute to on-chip communication of many-core SoCs in three main directions. First, we start with a synthesizable SoC with full system simulation. We demonstrate the importance of the networking overhead in a practical system, and propose our sophisticated network interface (NI) that offloads the work from SW to HW. Our results show around 5x and up to 50x higher network performance, compared to previous works. As the second direction of this thesis, we study the significance of run-time application mapping. We demonstrate that contiguous application mapping not only improves the network latency (by 23%) and power dissipation (by 50%), but also improves the system throughput (by 3%) and quality-of-service (QoS) of soft real-time applications (up to 100x less deadline misses). Also our hierarchical run-time application mapping provides 99.41% successful mapping when up to 8 links are broken. As the final direction of the thesis, we propose a fault-tolerant routing algorithm, the maze-routing. It is the first-in-class algorithm that provides guaranteed delivery, a fully-distributed solution, low area overhead (by 16x), and instantaneous reconfiguration (vs. 40K cycles down time of previous works), all at the same time. Besides the individual goals of each contribution, when applicable, we ensure that our solutions scale to extreme network sizes like 12x12 and 16x16. This thesis concludes that the communication overhead and its optimization play a significant role in the performance of many-core SoC

    Resilient Routing Implementation in 2D Mesh NoC

    No full text
    With the rapid shrinking of technology and growing integration capacity, the probability of failures in Networks-on-Chip (NoCs) increases and thus, fault tolerance is essential. Moreover, the unpredictable locations of these failures may influence the regularity of the underlying topology, and a regular 2D mesh is likely to become irregular. Thus, for these failure-prone networks, a viable routing framework should comprise a topology-agnostic routing algorithm along with a cost-effective, scalable routing mechanism able to handle failures, irrespective of any particular failure patterns. Existing routing techniques designed to route irregular topologies efficiently lack flexibility (logic-based), scalability (table-based) or relaxed switch design (uLBDR-based). Designing an efficient routing implementation technique to address irregular topologies remains a pressing research problem. To address this, we present a fault resilient routing mechanism for irregular 2D meshes resulting from failures. To handle irregularities, it avoids using routing tables and employs a few fixed configuration bits per switch resulting in a scalable approach. Experiments demonstrate that the proposed approach is guaranteed to tolerate all locations of single and double-link failures and most multiple failures. Also, unlike uLBDR it is not restricted to any particular switching technique and does not replicate any extra messages. Along with fault tolerance, the proposed mechanism can achieve better network performance in fault-free cases. The proposed technique achieves graceful performance degradation during failure. Compared to uLBDR, our method has 14% less area requirements and 16% less overall power consumption

    On Fault Resilient Network-on-Chip for Many Core Systems

    Get PDF
    Rapid scaling of transistor gate sizes has increased the density of on-chip integration and paved the way for heterogeneous many-core systems-on-chip, significantly improving the speed of on-chip processing. The design of the interconnection network of these complex systems is a challenging one and the network-on-chip (NoC) is now the accepted scalable and bandwidth efficient interconnect for multi-processor systems on-chip (MPSoCs). However, the performance enhancements of technology scaling come at the cost of reliability as on-chip components particularly the network-on-chip become increasingly prone to faults. In this thesis, we focus on approaches to deal with the errors caused by such faults. The results of these approaches are obtained not only via time-consuming cycle-accurate simulations but also by analytical approaches, allowing for faster and accurate evaluations, especially for larger networks. Redundancy is the general approach to deal with faults, the mode of which varies according to the type of fault. For the NoC, there exists a classification of faults into transient, intermittent and permanent faults. Transient faults appear randomly for a few cycles and may be caused by the radiation of particles. Intermittent faults are similar to transient faults, however, differing in the fact that they occur repeatedly at the same location, eventually leading to a permanent fault. Permanent faults by definition are caused by wires and transistors being permanently short or open. Generally, spatial redundancy or the use of redundant components is used for dealing with permanent faults. Temporal redundancy deals with failures by re-execution or by retransmission of data while information redundancy adds redundant information to the data packets allowing for error detection and correction. Temporal and information redundancy methods are useful when dealing with transient and intermittent faults. In this dissertation, we begin with permanent faults in NoC in the form of faulty links and routers. Our approach for spatial redundancy adds redundant links in the diagonal direction to the standard rectangular mesh topology resulting in the hexagonal and octagonal NoCs. In addition to redundant links, adaptive routing must be used to bypass faulty components. We develop novel fault-tolerant deadlock-free adaptive routing algorithms for these topologies based on the turn model without the use of virtual channels. Our results show that the hexagonal and octagonal NoCs can tolerate all 2-router and 3-router faults, respectively, while the mesh has been shown to tolerate all 1-router faults. To simplify the restricted-turn selection process for achieving deadlock freedom, we devised an approach based on the channel dependency matrix instead of the state-of-the-art Duato's method of observing the channel dependency graph for cycles. The approach is general and can be used for the turn selection process for any regular topology. We further use algebraic manipulations of the channel dependency matrix to analytically assess the fault resilience of the adaptive routing algorithms when affected by permanent faults. We present and validate this method for the 2D mesh and hexagonal NoC topologies achieving very high accuracy with a maximum error of 1%. The approach is very general and allows for faster evaluations as compared to the generally used cycle-accurate simulations. In comparison, existing works usually assume a limited number of faults to be able to analytically assess the network reliability. We apply the approach to evaluate the fault resilience of larger NoCs demonstrating the usefulness of the approach especially compared to cycle-accurate simulations. Finally, we concentrate on temporal and information redundancy techniques to deal with transient and intermittent faults in the router resulting in the dropping and hence loss of packets. Temporal redundancy is applied in the form of ARQ and retransmission of lost packets. Information redundancy is applied by the generation and transmission of redundant linear combinations of packets known as random linear network coding. We develop an analytic model for flexible evaluation of these approaches to determine the network performance parameters such as residual error rates and increased network load. The analytic model allows to evaluate larger NoCs and different topologies and to investigate the advantage of network coding compared to uncoded transmissions. We further extend the work with a small insight to the problem of secure communication over the NoC. Assuming large heterogeneous MPSoCs with components from third parties, the communication is subject to active attacks in the form of packet modification and drops in the NoC routers. Devising approaches to resolve these issues, we again formulate analytic models for their flexible and accurate evaluations, with a maximum estimation error of 7%

    Global Congestion and Fault Aware Wireless Interconnection Framework for Multicore Systems

    Get PDF
    Multicore processors are getting more common in the implementation of all type of computing demands, starting from personal computers to the large server farms for high computational demanding applications. The network-on-chip provides a better alternative to the traditional bus based communication infrastructure for this multicore system. Conventional wire-based NoC interconnect faces constraints due to their long multi-hop latency and high power consumption. Furthermore high traffic generating applications sometimes creates congestion in such system further degrading the systems performance. In this thesis work, a novel two-state congestion aware wireless interconnection framework for network chip is presented. This WiNoC system was designed to able to dynamically redirect traffic to avoid congestion based on network condition information shared among all the core tiles in the system. Hence a novel routing scheme and a two-state MAC protocol is proposed based on a proposed two layer hybrid mesh-based NoC architecture. The underlying mesh network is connected via wired-based interconnect and on top of that a shared wireless interconnect framework is added for single-hop communication. The routing scheme is non-deterministic in nature and utilizes the principles from existing dynamic routing algorithms. The MAC protocol for the wireless interface works in two modes. The first is data mode where a token-based protocol is utilized to transfer core data. And the second mode is the control mode where a broadcast-based communication protocol is used to share the network congestion information. The work details the switching methodology between these two modes and also explain, how the routing scheme utilizes the congestion information (gathered during the control mode) to route data packets during normal operation mode. The proposed work was modeled in a cycle accurate network simulator and its performance were evaluated against traditional NoC and WiNoC designs

    Bridging the Gap between Resilient Networks-on-Chip and Real-Time Systems

    Get PDF
    Conventional fault-tolerance approaches for Networks-on-Chip (NoCs) cannot be applied to high dependability systems due to their different goals and constraints. These systems impose strict integrity, resilience and real-time requirements. In order to meet these requirements, all possible effects of random hardware errors must be taken into account, silent data corruption must be prevented and the resulting system must be predictable in the presence of errors. In this paper, we present a wormhole-switched NoC with virtual channels for high dependability systems hardened against soft errors. The NoC is developed based on results of a Failure Mode and Effects Analysis. It efficiently handles errors in different network layers and operates with formal guarantees. Our experimental evaluation, including an industrial avionics use case, shows that the network is able to achieve predictable behavior even in aggressive environments with very high error rates while presenting competitive overheads
    corecore