261 research outputs found

    Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks

    Get PDF
    In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables. Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions

    Scheduled virtual topology design under periodic traffic in transparent optical networks

    Get PDF
    This paper investigates offline planning and scheduling in transparent optical networks for a given periodic traffic demand. The main objective is to minimize the number of transceivers needed which make up for the main network cost. We call this problem ldquoScheduled Virtual Topology Designrdquo and consider two variants: non-reconfigurable and reconfigurable equipment. We formulate both problems as exact MILPs (Mixed Integer Linear Programs). Due to their high complexity, we propose a more scalable tabu search heuristic approach, in conjunction with smaller MILP formulations for the associated subproblems. The main motivation of our research efforts is to assess the benefits of using reconfigurable equipment, realized as a reduction in the number of required transceivers. Our results show that the achieved reductions are not very significant, except for cases with large network loads and high traffic variability.The work described in this paper was carried out with the support of the BONE-project ("Building the Future Optical Network in Europe”), a Network of Excellence funded by the European Commission through the 7th ICTFramework Programme, support of the MEC Spanish project TEC2007- 67966-01/TCM CONPARTE-1 and developed in the framework of "Programa de Ayudas a Grupos de Excelencia de la Región de Murcia, de la Fundación Séneca (Plan Regional de Ciencia y Tecnología 2007/2010).

    New Fault Tolerant Multicast Routing Techniques to Enhance Distributed-Memory Systems Performance

    Get PDF
    Distributed-memory systems are a key to achieve high performance computing and the most favorable architectures used in advanced research problems. Mesh connected multicomputer are one of the most popular architectures that have been implemented in many distributed-memory systems. These systems must support communication operations efficiently to achieve good performance. The wormhole switching technique has been widely used in design of distributed-memory systems in which the packet is divided into small flits. Also, the multicast communication has been widely used in distributed-memory systems which is one source node sends the same message to several destination nodes. Fault tolerance refers to the ability of the system to operate correctly in the presence of faults. Development of fault tolerant multicast routing algorithms in 2D mesh networks is an important issue. This dissertation presents, new fault tolerant multicast routing algorithms for distributed-memory systems performance using wormhole routed 2D mesh. These algorithms are described for fault tolerant routing in 2D mesh networks, but it can also be extended to other topologies. These algorithms are a combination of a unicast-based multicast algorithm and tree-based multicast algorithms. These algorithms works effectively for the most commonly encountered faults in mesh networks, f-rings, f-chains and concave fault regions. It is shown that the proposed routing algorithms are effective even in the presence of a large number of fault regions and large size of fault region. These algorithms are proved to be deadlock-free. Also, the problem of fault regions overlap is solved. Four essential performance metrics in mesh networks will be considered and calculated; also these algorithms are a limited-global-information-based multicasting which is a compromise of local-information-based approach and global-information-based approach. Data mining is used to validate the results and to enlarge the sample. The proposed new multicast routing techniques are used to enhance the performance of distributed-memory systems. Simulation results are presented to demonstrate the efficiency of the proposed algorithms

    Topology design for time-varying networks

    Get PDF
    Traditional wireless networks seek to support end-to-end communication through either a single-hop wireless link to infrastructure or multi-hop wireless path to some destination. However, in some wireless networks (such as delay tolerant networks, or mobile social networks), due to sparse node distribution, node mobility, and time-varying network topology, end-to-end paths between the source and destination are not always available. In such networks, the lack of continuous connectivity, network partitioning, and long delays make design of network protocols very challenging. Previous DTN or time-varying network research mainly focuses on routing and information propagation. However, with large number of wireless devices' participation, and a lot of network functionality depends on the topology, how to maintain efficient and dynamic topology of a time-varying network becomes crucial. In this dissertation, I model a time-evolving network as a directed time-space graph which includes both spacial and temporal information of the network, then I study various topology control problems with such time-space graphs. First, I study the basic topology design problem where the links of the network are reliable. It aims to build a sparse structure from the original time-space graph such that (1) the network is still connected over time and/or supports efficient routing between any two nodes; (2) the total cost of the structure is minimized. I first prove that this problem is NP-hard, and then propose several greedy-based methods as solutions. Second, I further study a cost-efficient topology design problem, which not only requires the above two objective, but also guarantees that the spanning ratio of the topology is bounded by a given threshold. This problem is also NP-hard, and I give several greedy algorithms to solve it. Last, I consider a new topology design problem by relaxing the assumption of reliable links. Notice that in wireless networks the topologies are not quit predictable and the links are often unreliable. In this new model, each link has a probability to reflect its reliability. The new reliable topology design problem aims to build a sparse structure from the original space-time graph such that (1) for any pair of devices, there is a space-time path connecting them with the reliability larger than a required threshold; (2) the total cost of the structure is minimized. Several heuristics are proposed, which can significantly reduce the total cost of the topology while maintain the connectivity or reliability over time. Extensive simulations on both random networks and real-life tracing data have been conducted, and results demonstrate the efficiency of the proposed methods

    On the design of a cost-efficient resource management framework for low latency applications

    Get PDF
    The ability to offer low latency communications is one of the critical design requirements for the upcoming 5G era. The current practice for achieving low latency is to overprovision network resources (e.g., bandwidth and computing resources). However, this approach is not cost-efficient, and cannot be applied in large-scale. To solve this, more cost-efficient resource management is required to dynamically and efficiently exploit network resources to guarantee low latencies. The advent of network virtualization provides novel opportunities in achieving cost-efficient low latency communications. It decouples network resources from physical machines through virtualization, and groups resources in the form of virtual machines (VMs). By doing so, network resources can be flexibly increased at any network locations through VM auto-scaling to alleviate network delays due to lack of resources. At the same time, the operational cost can be largely reduced by shutting down low-utilized VMs (e.g., energy saving). Also, network virtualization enables the emerging concept of mobile edge-computing, whereby VMs can be utilized to host low latency applications at the network edge to shorten communication latency. Despite these advantages provided by virtualization, a key challenge is the optimal resource management of different physical and virtual resources for low latency communications. This thesis addresses the challenge by deploying a novel cost-efficient resource management framework that aims to solve the cost-efficient design of 1) low latency communication infrastructures; 2) dynamic resource management for low latency applications; and 3) fault-tolerant resource management. Compared to the current practices, the proposed framework achieves 80% of deployment cost reduction for the design of low latency communication infrastructures; continuously saves up to 33% of operational cost through dynamic resource management while always achieving low latencies; and succeeds in providing fault tolerance to low latency communications with a guaranteed operational cost

    Towards the development of a reliable reconfigurable real-time operating system on FPGAs

    Get PDF
    In the last two decades, Field Programmable Gate Arrays (FPGAs) have been rapidly developed from simple “glue-logic” to a powerful platform capable of implementing a System on Chip (SoC). Modern FPGAs achieve not only the high performance compared with General Purpose Processors (GPPs), thanks to hardware parallelism and dedication, but also better programming flexibility, in comparison to Application Specific Integrated Circuits (ASICs). Moreover, the hardware programming flexibility of FPGAs is further harnessed for both performance and manipulability, which makes Dynamic Partial Reconfiguration (DPR) possible. DPR allows a part or parts of a circuit to be reconfigured at run-time, without interrupting the rest of the chip’s operation. As a result, hardware resources can be more efficiently exploited since the chip resources can be reused by swapping in or out hardware tasks to or from the chip in a time-multiplexed fashion. In addition, DPR improves fault tolerance against transient errors and permanent damage, such as Single Event Upsets (SEUs) can be mitigated by reconfiguring the FPGA to avoid error accumulation. Furthermore, power and heat can be reduced by removing finished or idle tasks from the chip. For all these reasons above, DPR has significantly promoted Reconfigurable Computing (RC) and has become a very hot topic. However, since hardware integration is increasing at an exponential rate, and applications are becoming more complex with the growth of user demands, highlevel application design and low-level hardware implementation are increasingly separated and layered. As a consequence, users can obtain little advantage from DPR without the support of system-level middleware. To bridge the gap between the high-level application and the low-level hardware implementation, this thesis presents the important contributions towards a Reliable, Reconfigurable and Real-Time Operating System (R3TOS), which facilitates the user exploitation of DPR from the application level, by managing the complex hardware in the background. In R3TOS, hardware tasks behave just like software tasks, which can be created, scheduled, and mapped to different computing resources on the fly. The novel contributions of this work are: 1) a novel implementation of an efficient task scheduler and allocator; 2) implementation of a novel real-time scheduling algorithm (FAEDF) and two efficacious allocating algorithms (EAC and EVC), which schedule tasks in real-time and circumvent emerging faults while maintaining more compact empty areas. 3) Design and implementation of a faulttolerant microprocessor by harnessing the existing FPGA resources, such as Error Correction Code (ECC) and configuration primitives. 4) A novel symmetric multiprocessing (SMP)-based architectures that supports shared memory programing interface. 5) Two demonstrations of the integrated system, including a) the K-Nearest Neighbour classifier, which is a non-parametric classification algorithm widely used in various fields of data mining; and b) pairwise sequence alignment, namely the Smith Waterman algorithm, used for identifying similarities between two biological sequences. R3TOS gives considerably higher flexibility to support scalable multi-user, multitasking applications, whereby resources can be dynamically managed in respect of user requirements and hardware availability. Benefiting from this, not only the hardware resources can be more efficiently used, but also the system performance can be significantly increased. Results show that the scheduling and allocating efficiencies have been improved up to 2x, and the overall system performance is further improved by ~2.5x. Future work includes the development of Network on Chip (NoC), which is expected to further increase the communication throughput; as well as the standardization and automation of our system design, which will be carried out in line with the enablement of other high-level synthesis tools, to allow application developers to benefit from the system in a more efficient manner

    Analysis of minimal path routing schemes in the presence of faults

    Get PDF
    The design and analysis of fault tolerant message routing schemes for large parallel systems has been the focus of much recent research. In this paper, we present a framework for the analysis of routing schemes in distributed memory multiprocessor systems containing faulty or unusable components. We introduce techniques for the derivation of the probabilities of succesfully routing a single message using minimal path routing schemes. Using this framework, we derive closed form solutions for a wide range of routing schemes on the hypercube and on the two- dimensional mesh. The results obtained show the surprising resilience of the hypercube to a potentially large number of faults while demonstrating the inability of the mesh to tolerate a comparatively smaller number of faults.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/29944/1/0000302.pd

    Fault-Tolerant Topology Generation Method for Application-Specific Network-on-Chips

    Get PDF
    As the technology sizes of integrated circuits (ICs) scale down rapidly, current transistor densities on chips dramatically increase. While nanometer feature sizes allow denser chip designs in each technology generation, fabricated ICs become more susceptible to wear-outs, causing operation failure. Even a single link failure within an on-chip fabric can halt communication between application blocks, which makes the entire chip useless. In this paper, we aim to make faulty chips designed with network-on-chip (NoC) communication usable. Specifically, we present fault-tolerant irregular topology-generation method for application-specific NoC designs. Designed NoC topology allows different routing path if there is a link failure on the default routing path. Additionally, we present a simulated annealing-based application mapping algorithm aiming to minimize total energy consumption of the NoC design. We compare fault-tolerant topologies with nonfault-tolerant application-specific irregular topologies on energy consumption, performance, and area using multimedia benchmarks and custom-generated graphs. Our results demonstrate that our method is able to determine fault-tolerant topologies with negligible area increase and better energy values. © 1982-2012 IEEE

    Algorithms in fault-tolerant CLOS networks

    Get PDF

    Approaches to incorporating robustness into airline scheduling

    Get PDF
    Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2000.Includes bibliographical references (p. 93-94).The airline scheduling process used by major airlines today aims to develop optimal schedules which maximize revenue. However, these schedules are often far from "optimal" once deployed in the real world because they do not accurately take into account possible weather, air traffic control (ATC), and other disruptions that can occur during operation. The resulting flight delays and cancellations can cause significant revenue loss, not to mention service disruptions and customer dissatisfaction. A novel approach to addressing this problem is to design schedules that are robust to schedule disruptions and can be degraded at any airport location or in any region with minimal impact on the entire schedule. This research project suggests new methods for creating more robust airline schedules which can be easily recovered in the face of irregular operations. We show how to create multiple optimal solutions to the Aircraft Routing problem and suggest how to evaluate robustness of those solutions. Other potential methods for increasing robustness of airline schedules are reviewed.by Yana Ageeva.M.Eng
    corecore