Search CORE

39 research outputs found

Fault tolerant clos network

Author: Ahluwalia Preet Mohan S.
Publication venue: Digital Commons @ NJIT
Publication date: 31/01/1991
Field of study

Multistage interconnection networks, or MINs, provide paths between functional modules in multiprocessor systems. The MINs are usually segmented into several stages. Each stage connects inputs to appropriate links of the next stage so that the cumulative effect of all the stages satisfies input-output connection requirements. This thesis deals with a fault tolerant Clos network. The fault tolerance technique involves addition of extra switches per stage to compensate for any switch failure The reliability analysis of both ordinary and fault tolerant Clos networks is presented. The optimal number of extra switches required to get the best reliability results has been analyzed

Digital Commons @ New Jersey Institute of Technology (NJIT)

CLEX: Yet Another Supercomputer Architecture?

Author: Lenzen Christoph
Wattenhofer Roger
Publication venue
Publication date: 01/01/2016
Field of study

We propose the CLEX supercomputer topology and routing scheme. We prove that CLEX can utilize a constant fraction of the total bandwidth for point-to-point communication, at delays proportional to the sum of the number of intermediate hops and the maximum physical distance between any two nodes. Moreover, % applying an asymmetric bandwidth assignment to the links, all-to-all communication can be realized

(1+o(1))

-optimally both with regard to bandwidth and delays. This is achieved at node degrees of

n^{\varepsilon}

, for an arbitrary small constant

\varepsilon\in (0,1]

. In contrast, these results are impossible in any network featuring constant or polylogarithmic node degrees. Through simulation, we assess the benefits of an implementation of the proposed communication strategy. Our results indicate that, for a million processors, CLEX can increase bandwidth utilization and reduce average routing path length by at least factors

10

respectively

5

in comparison to a torus network. Furthermore, the CLEX communication scheme features several other properties, such as deadlock-freedom, inherent fault-tolerance, and canonical partition into smaller subsystems

arXiv.org e-Print Archive

MPG.PuRe

Fail-in-Place Network Design: Interaction Between Topology, Routing Algorithm and Failures

Author: Jens Domke
Satoshi Matsuoka
Torsten Hoefler
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 26/12/2014
Field of study

Abstract—The growing system size of high performance com-puters results in a steady decrease of the mean time between failures. Exchanging network components often requires whole system downtime which increases the cost of failures. In this work, we study a fail-in-place strategy where broken network elements remain untouched. We show, that a fail-in-place strategy is feasible for todays networks and the degradation is manageable, and provide guidelines for the design. Our network failure simulation toolchain allows system designers to extrapolate the performance degradation based on expected failure rates, and it can be used to evaluate the current state of a system. In a case study of real-world HPC systems, we will analyze the performance degradation throughout the systems lifetime under the assumption that faulty network components are not repaired, which results in a recommendation to change the used routing algorithm to improve the network performance as well as the fail-in-place characteristic. Keywords—Network design, network simulations, network man-agement, fail-in-place, routing protocols, fault tolerance, availability I

CiteSeerX

Crossref

Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks

Author: Domke Jens
Publication venue
Publication date: 16/06/2017
Field of study

In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables. Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions

Technische Universität Dresden: Qucosa

Characterisation and mitigation of long-term degradation effects in programmable logic

Author: Stott Edward A.
Publication venue: Electrical and Electronic Engineering, Imperial College London
Publication date: 01/02/2012
Field of study

Reliability has always been an issue in silicon device engineering, but until now it has been managed by the carefully tuned fabrication process. In the future the underlying physical limitations of silicon-based electronics, plus the practical challenges of manufacturing with such complexity at such a small scale, will lead to a crunch point where transistor-level reliability must be forfeited to continue achieving better productivity. Field-programmable gate arrays (FPGAs) are built on state-of-the-art silicon processes, but it has been recognised for some time that their distinctive characteristics put them in a favourable position over application-specific integrated circuits in the face of the reliability challenge. The literature shows how a regular structure, interchangeable resources and an ability to reconfigure can all be exploited to detect, locate, and overcome degradation and keep an FPGA application running. To fully exploit these characteristics, a better understanding is needed of the behavioural changes that are seen in the resources that make up an FPGA under ageing. Modelling is an attractive approach to this and in this thesis the causes and effects are explored of three important degradation mechanisms. All are shown to have an adverse affect on FPGA operation, but their characteristics show novel opportunities for ageing mitigation. Any modelling exercise is built on assumptions and so an empirical method is developed for investigating ageing on hardware with an accelerated-life test. Here, experiments show that timing degradation due to negative-bias temperature instability is the dominant process in the technology considered. Building on simulated and experimental results, this work also demonstrates a variety of methods for increasing the lifetime of FPGA lookup tables. The pre-emptive measure of wear-levelling is investigated in particular detail, and it is shown by experiment how di fferent reconfiguration algorithms can result in a significant reduction to the rate of degradation

Spiral - Imperial College Digital Repository

Application of Asynchronous Transfer Mode (Atm) technology to Picture Archiving and Communication Systems (Pacs): A survey

Author: Madabhushanam Sridharan R
Publication venue: Digital Scholarship@UNLV
Publication date: 01/01/1996
Field of study

Broadband Integrated Services Digital Network (R-ISDN) provides a range of narrowband and broad-band services for voice, video, and multimedia. Asynchronous Transfer Mode (ATM) has been selected by the standards bodies as the transfer mode for implementing B-ISDN; The ability to digitize images has lead to the prospect of reducing the physical space requirements, material costs, and manual labor of traditional film handling tasks in hospitals. The system which handles the acquisition, storage, and transmission of medical images is called a Picture Archiving and Communication System (PACS). The transmission system will directly impact the speed of image transfer. Today the most common transmission means used by acquisition and display station products is Ethernet. However, when considering network media, it is important to consider what the long term needs will be. Although ATM is a new standard, it is showing signs of becoming the next logical step to meet the needs of high speed networks; This thesis is a survey on ATM, and PACS. All the concepts involved in developing a PACS are presented in an orderly manner. It presents the recent developments in ATM, its applicability to PACS and the issues to be resolved for realising an ATM-based complete PACS. This work will be useful in providing the latest information, for any future research on ATM-based networks, and PACS

University of Nevada, Las Vegas Repository

Esprit '91. Proceedings of the annual Esprit conference. Brussels, 25-29 November 1991. EUR 13853 EN

Author
Publication venue
Publication date
Field of study

Fault tolerant programmable digital attitude control electronics study

Author: Sorensen A. A.
Publication venue
Publication date
Field of study

The attitude control electronics mechanization study to develop a fault tolerant autonomous concept for a three axis system is reported. Programmable digital electronics are compared to general purpose digital computers. The requirements, constraints, and tradeoffs are discussed. It is concluded that: (1) general fault tolerance can be achieved relatively economically, (2) recovery times of less than one second can be obtained, (3) the number of faulty behavior patterns must be limited, and (4) adjoined processes are the best indicators of faulty operation

NASA Technical Reports Server

Ontology engineering and routing in distributed knowledge management applications

Author: Tempich Christoph
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2006
Field of study

KITopen

Energy Management

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

Forecasts point to a huge increase in energy demand over the next 25 years, with a direct and immediate impact on the exhaustion of fossil fuels, the increase in pollution levels and the global warming that will have significant consequences for all sectors of society. Irrespective of the likelihood of these predictions or what researchers in different scientific disciplines may believe or publicly say about how critical the energy situation may be on a world level, it is without doubt one of the great debates that has stirred up public interest in modern times. We should probably already be thinking about the design of a worldwide strategic plan for energy management across the planet. It would include measures to raise awareness, educate the different actors involved, develop policies, provide resources, prioritise actions and establish contingency plans. This process is complex and depends on political, social, economic and technological factors that are hard to take into account simultaneously. Then, before such a plan is formulated, studies such as those described in this book can serve to illustrate what Information and Communication Technologies have to offer in this sphere and, with luck, to create a reference to encourage investigators in the pursuit of new and better solutions

Directory of Open Access Books (DOAB)