26 research outputs found

    New Fault Tolerant Multicast Routing Techniques to Enhance Distributed-Memory Systems Performance

    Get PDF
    Distributed-memory systems are a key to achieve high performance computing and the most favorable architectures used in advanced research problems. Mesh connected multicomputer are one of the most popular architectures that have been implemented in many distributed-memory systems. These systems must support communication operations efficiently to achieve good performance. The wormhole switching technique has been widely used in design of distributed-memory systems in which the packet is divided into small flits. Also, the multicast communication has been widely used in distributed-memory systems which is one source node sends the same message to several destination nodes. Fault tolerance refers to the ability of the system to operate correctly in the presence of faults. Development of fault tolerant multicast routing algorithms in 2D mesh networks is an important issue. This dissertation presents, new fault tolerant multicast routing algorithms for distributed-memory systems performance using wormhole routed 2D mesh. These algorithms are described for fault tolerant routing in 2D mesh networks, but it can also be extended to other topologies. These algorithms are a combination of a unicast-based multicast algorithm and tree-based multicast algorithms. These algorithms works effectively for the most commonly encountered faults in mesh networks, f-rings, f-chains and concave fault regions. It is shown that the proposed routing algorithms are effective even in the presence of a large number of fault regions and large size of fault region. These algorithms are proved to be deadlock-free. Also, the problem of fault regions overlap is solved. Four essential performance metrics in mesh networks will be considered and calculated; also these algorithms are a limited-global-information-based multicasting which is a compromise of local-information-based approach and global-information-based approach. Data mining is used to validate the results and to enlarge the sample. The proposed new multicast routing techniques are used to enhance the performance of distributed-memory systems. Simulation results are presented to demonstrate the efficiency of the proposed algorithms

    Exploring Adaptive Implementation of On-Chip Networks

    Get PDF
    As technology geometries have shrunk to the deep submicron regime, the communication delay and power consumption of global interconnections in high performance Multi- Processor Systems-on-Chip (MPSoCs) are becoming a major bottleneck. The Network-on- Chip (NoC) architecture paradigm, based on a modular packet-switched mechanism, can address many of the on-chip communication issues such as performance limitations of long interconnects and integration of large number of Processing Elements (PEs) on a chip. The choice of routing protocol and NoC structure can have a significant impact on performance and power consumption in on-chip networks. In addition, building a high performance, area and energy efficient on-chip network for multicore architectures requires a novel on-chip router allowing a larger network to be integrated on a single die with reduced power consumption. On top of that, network interfaces are employed to decouple computation resources from communication resources, to provide the synchronization between them, and to achieve backward compatibility with existing IP cores. Three adaptive routing algorithms are presented as a part of this thesis. The first presented routing protocol is a congestion-aware adaptive routing algorithm for 2D mesh NoCs which does not support multicast (one-to-many) traffic while the other two protocols are adaptive routing models supporting both unicast (one-to-one) and multicast traffic. A streamlined on-chip router architecture is also presented for avoiding congested areas in 2D mesh NoCs via employing efficient input and output selection. The output selection utilizes an adaptive routing algorithm based on the congestion condition of neighboring routers while the input selection allows packets to be serviced from each input port according to its congestion level. Moreover, in order to increase memory parallelism and bring compatibility with existing IP cores in network-based multiprocessor architectures, adaptive network interface architectures are presented to use multiple SDRAMs which can be accessed simultaneously. In addition, a smart memory controller is integrated in the adaptive network interface to improve the memory utilization and reduce both memory and network latencies. Three Dimensional Integrated Circuits (3D ICs) have been emerging as a viable candidate to achieve better performance and package density as compared to traditional 2D ICs. In addition, combining the benefits of 3D IC and NoC schemes provides a significant performance gain for 3D architectures. In recent years, inter-layer communication across multiple stacked layers (vertical channel) has attracted a lot of interest. In this thesis, a novel adaptive pipeline bus structure is proposed for inter-layer communication to improve the performance by reducing the delay and complexity of traditional bus arbitration. In addition, two mesh-based topologies for 3D architectures are also introduced to mitigate the inter-layer footprint and power dissipation on each layer with a small performance penalty.Siirretty Doriast

    Adaptive Hybrid Switching Technique for Parallel Computing System

    Get PDF
    Parallel processing accelerates computations by solving a single problem using multiple compute nodes interconnected by a network. The scalability of a parallel system is limited byits ability to communicate and coordinate processing. Circuit switching, packet switchingand wormhole routing are dominant switching techniques. Our simulation results show that wormhole routing and circuit switching each excel under different types of traffic.This dissertation presents a hybrid switching technique that combines wormhole routing with circuit switching in a single switch using vrtual channels and time division multiplexing. The performance of this hybrid switch is significantly impacted by the effciency of traffic scheduling and thus, this dissertation also explores the design and scalability of hardware scheduling for the hybrid switch. In particular, we introduce two schedulers for crossbar networks: a greedy scheduler and an optimal scheduler that improves upon the resultsprovided by the greedy scheduler. For the time division multiplexing portion of the hybrid switch, this dissertation presents three allocation methods that combine wormhole switching with predictive circuit switching. We further extend this research from crossbar networks to fat tree interconnected networks with virtual channels. The global "level-wise" scheduling algorithm is presented and improves network utilization by 30% when compared to a switch-level algorithm. The performance of the hybrid switching is evaluated on a cycle accurate simulation framework that is also part of this dissertation research. Our experimental results demonstrate that the hybrid switch is capable of transferring both predictable traffics and unpredictable traffics successfully. By dynamically selecting the proper switching technique based on the type of communication traffic, the hybrid switch improves communication for most types of traffic

    High performance communication on reconfigurable clusters

    Get PDF
    High Performance Computing (HPC) has matured to where it is an essential third pillar, along with theory and experiment, in most domains of science and engineering. Communication latency is a key factor that is limiting the performance of HPC, but can be addressed by integrating communication into accelerators. This integration allows accelerators to communicate with each other without CPU interactions, and even bypassing the network stack. Field Programmable Gate Arrays (FPGAs) are the accelerators that currently best integrate communication with computation. The large number of Multi-gigabit Transceivers (MGTs) on most high-end FPGAs can provide high-bandwidth and low-latency inter-FPGA connections. Additionally, the reconfigurable FPGA fabric enables tight coupling between computation kernel and network interface. Our thesis is that an application-aware communication infrastructure for a multi-FPGA system makes substantial progress in solving the HPC communication bottleneck. This dissertation aims to provide an application-aware solution for communication infrastructure for FPGA-centric clusters. Specifically, our solution demonstrates application-awareness across multiple levels in the network stack, including low-level link protocols, router microarchitectures, routing algorithms, and applications. We start by investigating the low-level link protocol and the impact of its latency variance on performance. Our results demonstrate that, although some link jitter is always present, we can still assume near-synchronous communication on an FPGA-cluster. This provides the necessary condition for statically-scheduled routing. We then propose two novel router microarchitectures for two different kinds of workloads: a wormhole Virtual Channel (VC)-based router for workloads with dynamic communication, and a statically-scheduled Virtual Output Queueing (VOQ)-based router for workloads with static communication. For the first (VC-based) router, we propose a framework that generates application-aware router configurations. Our results show that, by adding application-awareness into router configuration, the network performance of FPGA clusters can be substantially improved. For the second (VOQ-based) router, we propose a novel offline collective routing algorithm. This shows a significant advantage over a state-of-the-art collective routing algorithm. We apply our communication infrastructure to a critical strong-scaling HPC kernel, the 3D FFT. The experimental results demonstrate that the performance of our design is faster than that on CPUs and GPUs by at least one order of magnitude (achieving strong scaling for the target applications). Surprisingly, the FPGA cluster performance is similar to that of an ASIC-cluster. We also implement the 3D FFT on another multi-FPGA platform: the Microsoft Catapult II cloud. Its performance is also comparable or superior to CPU and GPU HPC clusters. The second application we investigate is Molecular Dynamics Simulation (MD). We model MD on both FPGA clouds and clusters. We find that combining processing and general communication in the same device leads to extremely promising performance and the prospect of MD simulations well into the us/day range with a commodity cloud

    Routing and Wavelength Assignment for Multicast Communication in Optical Network-on-Chip

    Get PDF
    An Optical Network-on-Chip (ONoC) is an emerging chip-level optical interconnection technology to realise high-performance and power-efficient inter-core communication for many-core processors. Within the field, multicast communication is one of the most important inter-core communication forms. It is not only widely used in parallel computing applications in Chip Multi-Processors (CMPs), but also common in emerging areas such as neuromorphic computing. While many studies have been conducted on designing ONoC architectures and routing schemes to support multicast communication, most existing solutions adopt the methods that were initially proposed for electrical interconnects. These solutions can neither fully take advantage of optical communication nor address the special requirements of an ONoC. Moreover, most of them focus only on the optimisation of one multicast, which limits the practical applications because real systems often have to handle multiple multicasts requested from various applications. Hence, this thesis will address the design of a high-performance communication scheme for multiple multicasts by taking into account the unique characteristics and constraints of an ONoC. This thesis studies the problem from a network-level perspective. The design methodology is to optimally route all multicasts requested simultaneously from the applications in an ONoC, with the objective of efficiently utilising available wavelengths. The novelty is to adopt multicast-splitting strategies, where a multicast can be split into several sub-multicasts according to the distribution of multicast nodes, in order to reduce the conflicts of different multicasts. As routing and wavelength assignment problem is an NP-hard problem, heuristic approaches that use the multicast-splitting strategy are proposed in this thesis. Specifically, three routing and wavelength assignment schemes for multiple multicasts in an ONoC are proposed for different problem domains. Firstly, PRWAMM, a Path-based Routing and Wavelength Assignment for Multiple Multicasts in an ONoC, is proposed. Due to the low manufacture complexity requirement of an ONoC, e.g., no splitters, path-based routing is studied in PRWAMM. Two wavelength-assignment strategies for multiple multicasts under path-based routing are proposed. One is an intramulticast wavelength assignment, which assigns wavelength(s) for one multicast. The other is an inter-multicast wavelength assignment, which assigns wavelength(s) for different multicasts, according to the distributions of multicasts. Simulation results show that PRWAMM can reduce the average number of wavelengths by 15% compared to other path-based schemes. Secondly, RWADMM, a Routing and Wavelength Assignment scheme for Distribution-based Multiple Multicasts in a 2D ONoC, is proposed. Because path-based routing lacks flexibility, it cannot reduce the link conflicts effectively. Hence, RWADMM is designed, based on the distribution of different multicasts, which includes two algorithms. One is an optimal routing and wavelength assignment algorithm for special distributions of multicast nodes. The other is a heuristic routing and wavelength assignment algorithm for random distributions of multicast nodes. Simulation results show that RWADMM can reduce the number of wavelengths by 21.85% on average, compared to the state-of-the-art solutions in a 2D ONoC. Thirdly, CRRWAMM, a Cluster-based Routing and Reusable Wavelength Assignment scheme for Multiple Multicasts in a 3D ONoC, is proposed. Because of the different architectures with a 2D ONoC (e.g., the layout of nodes, optical routers), the methods designed for a 2D ONoC cannot be simply extended to a 3D ONoC. In CRRWAMM, the distribution of multicast nodes in a mesh-based 3D ONoC is analysed first. Then, routing theorems for special instances are derived. Based on the theorems, a general routing scheme, which includes a cluster-based routing method and a reusable wavelength assignment method, is proposed. Simulation results show that CRRWAMM can reduce the number of wavelengths by 33.2% on average, compared to other schemes in a 3D ONoC. Overall, the three routing and wavelength assignment schemes can achieve high-performance multicast communication for multiple multicasts of their problem domains in an ONoC. They all have the advantages of a low routing complexity, a low wavelength requirement, and good scalability, compared to their counterparts, respectively. These methods make an ONoC a flexible high-performance computing platform to execute various parallel applications with different multicast requirements. As future work, I will investigate the power consumption of various routing schemes for multicasts. Using a multicast-splitting strategy may increase power consumption since it needs different wavelengths to send packets to different destinations for one multicast, though the reduction of wavelengths used in the schemes can also potentially decrease overall power consumption. Therefore, how to achieve the best trade-off between the total number of wavelengths used and the number of sub-multicasts in order to reduce power consumption will be interesting future research

    Network-on-Chip

    Get PDF
    Addresses the Challenges Associated with System-on-Chip Integration Network-on-Chip: The Next Generation of System-on-Chip Integration examines the current issues restricting chip-on-chip communication efficiency, and explores Network-on-chip (NoC), a promising alternative that equips designers with the capability to produce a scalable, reusable, and high-performance communication backbone by allowing for the integration of a large number of cores on a single system-on-chip (SoC). This book provides a basic overview of topics associated with NoC-based design: communication infrastructure design, communication methodology, evaluation framework, and mapping of applications onto NoC. It details the design and evaluation of different proposed NoC structures, low-power techniques, signal integrity and reliability issues, application mapping, testing, and future trends. Utilizing examples of chips that have been implemented in industry and academia, this text presents the full architectural design of components verified through implementation in industrial CAD tools. It describes NoC research and developments, incorporates theoretical proofs strengthening the analysis procedures, and includes algorithms used in NoC design and synthesis. In addition, it considers other upcoming NoC issues, such as low-power NoC design, signal integrity issues, NoC testing, reconfiguration, synthesis, and 3-D NoC design. This text comprises 12 chapters and covers: The evolution of NoC from SoC—its research and developmental challenges NoC protocols, elaborating flow control, available network topologies, routing mechanisms, fault tolerance, quality-of-service support, and the design of network interfaces The router design strategies followed in NoCs The evaluation mechanism of NoC architectures The application mapping strategies followed in NoCs Low-power design techniques specifically followed in NoCs The signal integrity and reliability issues of NoC The details of NoC testing strategies reported so far The problem of synthesizing application-specific NoCs Reconfigurable NoC design issues Direction of future research and development in the field of NoC Network-on-Chip: The Next Generation of System-on-Chip Integration covers the basic topics, technology, and future trends relevant to NoC-based design, and can be used by engineers, students, and researchers and other industry professionals interested in computer architecture, embedded systems, and parallel/distributed systems

    Neural networks-on-chip for hybrid bio-electronic systems

    Get PDF
    PhD ThesisBy modelling the brains computation we can further our understanding of its function and develop novel treatments for neurological disorders. The brain is incredibly powerful and energy e cient, but its computation does not t well with the traditional computer architecture developed over the previous 70 years. Therefore, there is growing research focus in developing alternative computing technologies to enhance our neural modelling capability, with the expectation that the technology in itself will also bene t from increased awareness of neural computational paradigms. This thesis focuses upon developing a methodology to study the design of neural computing systems, with an emphasis on studying systems suitable for biomedical experiments. The methodology allows for the design to be optimized according to the application. For example, di erent case studies highlight how to reduce energy consumption, reduce silicon area, or to increase network throughput. High performance processing cores are presented for both Hodgkin-Huxley and Izhikevich neurons incorporating novel design features. Further, a complete energy/area model for a neural-network-on-chip is derived, which is used in two exemplar case-studies: a cortical neural circuit to benchmark typical system performance, illustrating how a 65,000 neuron network could be processed in real-time within a 100mW power budget; and a scalable highperformance processing platform for a cerebellar neural prosthesis. From these case-studies, the contribution of network granularity towards optimal neural-network-on-chip performance is explored
    corecore