Search CORE

53 research outputs found

On Fault Resilient Network-on-Chip for Many Core Systems

Author: Moriam Sadia
Publication venue
Publication date: 24/05/2019
Field of study

Rapid scaling of transistor gate sizes has increased the density of on-chip integration and paved the way for heterogeneous many-core systems-on-chip, significantly improving the speed of on-chip processing. The design of the interconnection network of these complex systems is a challenging one and the network-on-chip (NoC) is now the accepted scalable and bandwidth efficient interconnect for multi-processor systems on-chip (MPSoCs). However, the performance enhancements of technology scaling come at the cost of reliability as on-chip components particularly the network-on-chip become increasingly prone to faults. In this thesis, we focus on approaches to deal with the errors caused by such faults. The results of these approaches are obtained not only via time-consuming cycle-accurate simulations but also by analytical approaches, allowing for faster and accurate evaluations, especially for larger networks. Redundancy is the general approach to deal with faults, the mode of which varies according to the type of fault. For the NoC, there exists a classification of faults into transient, intermittent and permanent faults. Transient faults appear randomly for a few cycles and may be caused by the radiation of particles. Intermittent faults are similar to transient faults, however, differing in the fact that they occur repeatedly at the same location, eventually leading to a permanent fault. Permanent faults by definition are caused by wires and transistors being permanently short or open. Generally, spatial redundancy or the use of redundant components is used for dealing with permanent faults. Temporal redundancy deals with failures by re-execution or by retransmission of data while information redundancy adds redundant information to the data packets allowing for error detection and correction. Temporal and information redundancy methods are useful when dealing with transient and intermittent faults. In this dissertation, we begin with permanent faults in NoC in the form of faulty links and routers. Our approach for spatial redundancy adds redundant links in the diagonal direction to the standard rectangular mesh topology resulting in the hexagonal and octagonal NoCs. In addition to redundant links, adaptive routing must be used to bypass faulty components. We develop novel fault-tolerant deadlock-free adaptive routing algorithms for these topologies based on the turn model without the use of virtual channels. Our results show that the hexagonal and octagonal NoCs can tolerate all 2-router and 3-router faults, respectively, while the mesh has been shown to tolerate all 1-router faults. To simplify the restricted-turn selection process for achieving deadlock freedom, we devised an approach based on the channel dependency matrix instead of the state-of-the-art Duato's method of observing the channel dependency graph for cycles. The approach is general and can be used for the turn selection process for any regular topology. We further use algebraic manipulations of the channel dependency matrix to analytically assess the fault resilience of the adaptive routing algorithms when affected by permanent faults. We present and validate this method for the 2D mesh and hexagonal NoC topologies achieving very high accuracy with a maximum error of 1%. The approach is very general and allows for faster evaluations as compared to the generally used cycle-accurate simulations. In comparison, existing works usually assume a limited number of faults to be able to analytically assess the network reliability. We apply the approach to evaluate the fault resilience of larger NoCs demonstrating the usefulness of the approach especially compared to cycle-accurate simulations. Finally, we concentrate on temporal and information redundancy techniques to deal with transient and intermittent faults in the router resulting in the dropping and hence loss of packets. Temporal redundancy is applied in the form of ARQ and retransmission of lost packets. Information redundancy is applied by the generation and transmission of redundant linear combinations of packets known as random linear network coding. We develop an analytic model for flexible evaluation of these approaches to determine the network performance parameters such as residual error rates and increased network load. The analytic model allows to evaluate larger NoCs and different topologies and to investigate the advantage of network coding compared to uncoded transmissions. We further extend the work with a small insight to the problem of secure communication over the NoC. Assuming large heterogeneous MPSoCs with components from third parties, the communication is subject to active attacks in the form of packet modification and drops in the NoC routers. Devising approaches to resolve these issues, we again formulate analytic models for their flexible and accurate evaluations, with a maximum estimation error of 7%

Technische Universität Dresden: Qucosa

SpiNNaker: Fault tolerance in a power- and area- constrained large-scale neuromimetic architecture

Author: Furber Steve
Garside Jim
Jin Xin
Khan Mukaram
Lester David
Luján Mikel
Miguel-Alonso José
Navaridas Javier
Painkras Eustace
Patterson Cameron
Plana Luis A.
Rast Alexander
Richards Dominic
Shi Yebin
Temple Steve
Wu Jian
Yang Shufan
Publication venue: The Authors. Published by Elsevier B.V.
Publication date: 01/01/2013
Field of study

AbstractSpiNNaker is a biologically-inspired massively-parallel computer designed to model up to a billion spiking neurons in real-time. A full-fledged implementation of a SpiNNaker system will comprise more than 105 integrated circuits (half of which are SDRAMs and half multi-core systems-on-chip). Given this scale, it is unavoidable that some components fail and, in consequence, fault-tolerance is a foundation of the system design. Although the target application can tolerate a certain, low level of failures, important efforts have been devoted to incorporate different techniques for fault tolerance. This paper is devoted to discussing how hardware and software mechanisms collaborate to make SpiNNaker operate properly even in the very likely scenario of component failures and how it can tolerate system-degradation levels well above those expected

Elsevier - Publisher Connector

Crossref

Enlighten

The University of Manchester - Institutional Repository

A SystemC platform for Network-on-Chip performance/power evaluation and comparison

Author: CONTI MASSIMO
S. GIGLI
Publication venue
Publication date: 01/01/2009
Field of study

IRIS UniversitÃ Politecnica delle Marche

Submicron Systems Architecture Project: Semiannual Technial Report

Author: Seitz Charles L.
Publication venue: 'California Institute of Technology Library'
Publication date: 01/01/1989
Field of study

No abstract available

Caltech Authors

Multi-Objective Routing for Distributed Controllers

Author: Rubin Konstantin Y.
Publication venue: Scholar Commons
Publication date: 01/10/2021
Field of study

A long-term goal of future naval shipboard power systems is the ability to manage energy flow with sufficient flexibility to accommodate future platform requirements such as better survivability, continuity, and support of pulsed and other demanding loads. To facilitate scalable, low-latency global distributed system control, each control module can include an integrated network interface connected through multiple channels onto a direct, multi-hop network topology. In this work, we focus on a 2D Torus, in which control nodes are arranged in a regular 2D grid, with each node connected through point-to-point connections to its four immediate neighbors. An important advantage of 2D Tori is their redundant topology where there is more than one minimal path between any source and destination as long as they do not share the same row or column in the grid. For the static, all-to-one traffic pattern used by a central controller, the number of minimal routing tables grows as O(N!N2). This dissertation presents a novel approach to generating routing tables that achieve two performance objectives: (1) minimal control period latency, the lower bound of which is the round trip latency of the messages exchanged between the controller and the node having the longest route, and (2) minimal latency jitter. Our approach relies on creating a large system of integer linear algebra equations describing (i) functionality of a network and (ii) constraints needed for perfect load balance and low jitter. We use Gurobi ILA solver to find a satisfying assignment of all boolean variables representing where packets are scheduled to be in a certain timeframe. Experimental results show that our software pipeline generates routing tables that (i) are guaranteed to have perfect load balance regardless of shape and size of the network and (ii) lower jitter than any of randomly generated routing tables which we simulated. Our software also has an option of generating routing tables that allow packets to follow non minimum hop count paths as well as being held in the source nodes for some time instead of immediately rushing to the master node. That helps packets avoid congested areas, and, as the results show, achieves up to 2x improvement in jitter

Scholar Commons - Institutional Repository of the University of South Carolina

Submicron Systems Architecture Project: Semiannual Technical Report

Author: Seitz Charles L.
Publication venue: 'California Institute of Technology Library'
Publication date: 01/01/1987
Field of study

No abstract available

Caltech Authors

Recommended from our members

Interconnection Networks Based on Gaussian and Eisenstein-Jacobi Integers

Author: Shamaei Arash
Publication venue: 'Oregon State University'
Publication date
Field of study

Quotient rings of Gaussian and Eisenstein-Jacobi(EJ) integers can be deployed to construct interconnection networks with good topological properties. In this thesis, we propose deadlock-free deterministic and partially adaptive routing algorithms for hexagonal networks, one special class of EJ networks. Then we discuss higher dimensional Gaussian networks as an alternative to classical multidimensional toroidal networks. For this topology, we explore many properties including distance distribution and the decomposition of higher dimensional Gaussian net works into Hamiltonian cycles. In addition, we propose some efficient communication algorithms for higher dimensional Gaussian networks including one-to-all broadcasting and shortest path routing. Simulation results show that the routing algorithm proposed for higher dimensional Gaussian networks outperforms the routing algorithm of the corresponding torus networks with approximately the same number of nodes. These simulation results are expected since higher dimensional Gaussian networks have a smaller diameter and a smaller average message latency as compared with toroidal networks. Finally, we introduce a degree-three interconnection network obtained from pruning a Gaussian network. This network shows possible performance improvement over other degree-three networks since it has a smaller diameter compared to other degree-three networks. Many topological properties of degree-three pruned Gaussian network are explored. In addition, an optimal shortest path routing algorithm and a one-to-all broadcasting algorithm are given

ScholarsArchive@OSU

Fault diagnosis of distributed systems : analysis, simulation and performance measurement.

Author: Mohammed Thabit Sultan
Publication venue
Publication date: 01/01/1992
Field of study

Fault diagnosis forms an essential component in the design of highly reliable distributed computing systems. Early models for diagnosis require a global observer, whereas the diagnosis is shared between the systems nodes in later models. These models are reviewed and their different diagnosability properties reconciled. The design of improved fault diagnosis algorithms for systems without a global observer provides the main motivation for the thesis. The modified algorithm SELF3 [Hoss88] is taken as a starting point. A number of communication architectures used in distributed systems are reviewed. The properties of diagnosis algorithms depend strongly on the testing graph. A general class of testing graphs, designated as H-graphs, (which are a generalization of Dꞩṭ graphs introduced in [Prep67]), are investigated and their diagnostic properties determined. A software simulator for distributed systems has been written as the main investigative tool for diagnosis algorithms. The design and structure of the simulator are described. The diagnosis process is measured in terms of diagnostic time and number of messages produced, and the factors upon which these quantities depend are identified. The results of simulation of a number of systems are given under various fault conditions. A modified way of routing diagnosis messages, which, especially in large system s, results in a reduction in both the number of diagnosis messages and the time required to perform diagnosis, is presented. The thesis also contains a number of specific recommendations for improving existing self-diagnosis algorithms

Cranfield CERES

Globally asynchronous locally synchronous configurable array architecture for algorithm embeddings

Author: Gao Bo
Publication venue: The University of Edinburgh
Publication date: 01/01/1996
Field of study

Edinburgh Research Archive