38 research outputs found

    New Fault Tolerant Multicast Routing Techniques to Enhance Distributed-Memory Systems Performance

    Get PDF
    Distributed-memory systems are a key to achieve high performance computing and the most favorable architectures used in advanced research problems. Mesh connected multicomputer are one of the most popular architectures that have been implemented in many distributed-memory systems. These systems must support communication operations efficiently to achieve good performance. The wormhole switching technique has been widely used in design of distributed-memory systems in which the packet is divided into small flits. Also, the multicast communication has been widely used in distributed-memory systems which is one source node sends the same message to several destination nodes. Fault tolerance refers to the ability of the system to operate correctly in the presence of faults. Development of fault tolerant multicast routing algorithms in 2D mesh networks is an important issue. This dissertation presents, new fault tolerant multicast routing algorithms for distributed-memory systems performance using wormhole routed 2D mesh. These algorithms are described for fault tolerant routing in 2D mesh networks, but it can also be extended to other topologies. These algorithms are a combination of a unicast-based multicast algorithm and tree-based multicast algorithms. These algorithms works effectively for the most commonly encountered faults in mesh networks, f-rings, f-chains and concave fault regions. It is shown that the proposed routing algorithms are effective even in the presence of a large number of fault regions and large size of fault region. These algorithms are proved to be deadlock-free. Also, the problem of fault regions overlap is solved. Four essential performance metrics in mesh networks will be considered and calculated; also these algorithms are a limited-global-information-based multicasting which is a compromise of local-information-based approach and global-information-based approach. Data mining is used to validate the results and to enlarge the sample. The proposed new multicast routing techniques are used to enhance the performance of distributed-memory systems. Simulation results are presented to demonstrate the efficiency of the proposed algorithms

    Exploring Adaptive Implementation of On-Chip Networks

    Get PDF
    As technology geometries have shrunk to the deep submicron regime, the communication delay and power consumption of global interconnections in high performance Multi- Processor Systems-on-Chip (MPSoCs) are becoming a major bottleneck. The Network-on- Chip (NoC) architecture paradigm, based on a modular packet-switched mechanism, can address many of the on-chip communication issues such as performance limitations of long interconnects and integration of large number of Processing Elements (PEs) on a chip. The choice of routing protocol and NoC structure can have a significant impact on performance and power consumption in on-chip networks. In addition, building a high performance, area and energy efficient on-chip network for multicore architectures requires a novel on-chip router allowing a larger network to be integrated on a single die with reduced power consumption. On top of that, network interfaces are employed to decouple computation resources from communication resources, to provide the synchronization between them, and to achieve backward compatibility with existing IP cores. Three adaptive routing algorithms are presented as a part of this thesis. The first presented routing protocol is a congestion-aware adaptive routing algorithm for 2D mesh NoCs which does not support multicast (one-to-many) traffic while the other two protocols are adaptive routing models supporting both unicast (one-to-one) and multicast traffic. A streamlined on-chip router architecture is also presented for avoiding congested areas in 2D mesh NoCs via employing efficient input and output selection. The output selection utilizes an adaptive routing algorithm based on the congestion condition of neighboring routers while the input selection allows packets to be serviced from each input port according to its congestion level. Moreover, in order to increase memory parallelism and bring compatibility with existing IP cores in network-based multiprocessor architectures, adaptive network interface architectures are presented to use multiple SDRAMs which can be accessed simultaneously. In addition, a smart memory controller is integrated in the adaptive network interface to improve the memory utilization and reduce both memory and network latencies. Three Dimensional Integrated Circuits (3D ICs) have been emerging as a viable candidate to achieve better performance and package density as compared to traditional 2D ICs. In addition, combining the benefits of 3D IC and NoC schemes provides a significant performance gain for 3D architectures. In recent years, inter-layer communication across multiple stacked layers (vertical channel) has attracted a lot of interest. In this thesis, a novel adaptive pipeline bus structure is proposed for inter-layer communication to improve the performance by reducing the delay and complexity of traditional bus arbitration. In addition, two mesh-based topologies for 3D architectures are also introduced to mitigate the inter-layer footprint and power dissipation on each layer with a small performance penalty.Siirretty Doriast

    Interconnects architectures for many-core era using surface-wave communication

    Get PDF
    PhD ThesisNetworks-on-chip (NoCs) is a communication paradigm that has emerged aiming to address on-chip communication challenges and to satisfy interconnection demands for chip-multiprocessors (CMPs). Nonetheless, there is continuous demand for even higher computational power, which is leading to a relentless downscaling of CMOS technology to enable the integration of many-cores. However, technology downscaling is in favour of the gate nodes over wires in terms of latency and power consumption. Consequently, this has led to the era of many-core processors where power consumption and performance are governed by inter-core communications rather than core computation. Therefore, NoCs need to evolve from being merely metalbased implementations which threaten to be a performance and power bottleneck for many-core efficiency and scalability. To overcome such intensified inter-core communication challenges, this thesis proposes a novel interconnect technology: the surface-wave interconnect (SWI). This new RF-based on-chip interconnect has notable characteristics compared to cutting-edge on-chip interconnects in terms of CMOS compatibility, high speed signal propagation, low power dissipation, and massive signal fan-out. Nonetheless, the realization of the SWI requires investigations at different levels of abstraction, such as the device integration and RF engineering levels. The aim of this thesis is to address the networking and system level challenges and highlight the potential of this interconnect. This should encourage further research at other levels of abstraction. Two specific system-level challenges crucial in future many-core systems are tackled in this study, which are cross-the-chip global communication and one-to-many communication. This thesis makes four major contributions towards this aim. The first is reducing the NoC average-hop count, which would otherwise increase packet-latency exponentially, by proposing a novel hybrid interconnect architecture. This hybrid architecture can not only utilize both regular metal-wire and SWI, but also exploits merits of both bus and NoC architectures in terms of connectivity compared to other general-purpose on-chip interconnect architectures. The second contribution addresses global communication issues by developing a distance-based weighted-round-robin arbitration (DWA) algorithm. This technique prioritizes global communication to be send via SWI short-cuts, which offer more efficient power dissipation and faster across-the-chip signal propagation. Results obtained using a cycleaccurate simulator demonstrate the effectiveness of the proposed system architecture in terms of significant power reduction, considervii able average delay reduction and higher throughput compared to a regular NoC. The third contribution is in handling multicast communications, which are normally associated with traffic overload, hotspots and deadlocks and therefore increase, by an order of magnitude the power consumption and latency. This has been achieved by proposing a novel routing and centralized arbitration schemes that exploits the SWI0s remarkable fan-out features. The evaluation demonstrates drastic improvements in the effectiveness of the proposed architecture in terms of power consumption ( 2-10x) and performance ( 22x) but with negligible hardware overheads ( 2%). The fourth contribution is to further explore multicast contention handling in a flexible decentralized manner, where original techniques such as stretch-multicast and ID-tagging flow control have been developed. A comparison of these techniques shows that the decentralized approach is superior to the centralized approach with low traffic loads, while the latter outperforms the former near and after NoC saturation

    Adaptive and Deadlock-Free Tree-Based Multicast Routing for Networks-on-Chip

    Get PDF
    This paper presents the first synthesizable network-on-chip (NoC) based on a mesh topology, which supports adaptive and deadlock-free tree-based multicast routing without virtual channels. The deadlock-free routing algorithms for unicast and multicast packets are the same. Therefore, the routing function\ud gate-level implementation is very efficient. Multicast packets\ud are injected to the network by sending multiple packet headers beforehand. The packet headers contain destination addresses to set up multicast trees connecting a source with multiple destination nodes. An additional locally uniform identification (ID) field is packetized together with flits belonging to the same packet. Therefore, flits of different unicast or multicast packets can be interleaved in the same queue because of the local ID-tags, which are updated and mapped dynamically to support bandwidth scalability of interconnection links. Deadlocks in tree-based multicast\ud routing are handled using a flit-by-flit round arbitration and a\ud fair hold???release tagging mechanism. The effectiveness of the novel mechanism has been experimented under multiple multicast\ud conflicts scenarios, where the experimental results show that all traffic is accepted in-order and lossless in their destination nodes even if adaptive routing functions are used and the sizes of the\ud multicast messages are very long

    Tinsel: a manythread overlay for FPGA clusters

    Get PDF
    Commodity FPGA boards with advanced networking facilities have great potential in the construction of high-performance compute clusters that scale. However, low-level design tools and long synthesis times are major barriers to productivity for application developers. In this paper, we explore the potential of a distributed soft-processor overlay, programmed in software at a high-level of abstraction, to deliver a useful level of performance for FPGA clusters. In particular, we demonstrate the use of hardware multhreading to achieve a fast, space-efficient, high-throughput overlay, and compare a 12-FPGA instance of it (12,288 RISC-V threads) against a conventional Xeon cluster on the problem of distributed graph processing.This work was supported by EPSRC grant EP/N031768/1 (POETS project)

    Network on chip architecture for multi-agent systems in FPGA

    Get PDF
    A system of interacting agents is, by definition, very demanding in terms of computational resources. Although multi-agent systems have been used to solve complex problems in many areas, it is usually very difficult to perform large-scale simulations in their targeted serial computing platforms. Reconfigurable hardware, in particular Field Programmable Gate Arrays (FPGA) devices, have been successfully used in High Performance Computing applications due to their inherent flexibility, data parallelism and algorithm acceleration capabilities. Indeed, reconfigurable hardware seems to be the next logical step in the agency paradigm, but only a few attempts have been successful in implementing multi-agent systems in these platforms. This paper discusses the problem of inter-agent communications in Field Programmable Gate Arrays. It proposes a Network-on-Chip in a hierarchical star topology to enable agents’ transactions through message broadcasting using the Open Core Protocol, as an interface between hardware modules. A customizable router microarchitecture is described and a multi-agent system is created to simulate and analyse message exchanges in a generic heavy traffic load agent-based application. Experiments have shown a throughput of 1.6Gbps per port at 100 MHz without packet loss and seamless scalability characteristics

    Exploration architecturale et étude des performances des réseaux sur puce 3D partiellement connectés verticalement

    Get PDF
    Utilization of the third dimension can lead to a significant reduction in power and average hop-count in Networks- on-Chip (NoC). TSV technology, as the most promising technology in 3D integration, offers short and fast vertical links which copes with the long wire problem in 2D NoCs. Nonetheless, TSVs are huge and their manufacturing process is still immature, which reduces the yield of 3D NoC based SoC. Therefore, Vertically-Partially-Connected 3D-NoC has been introduced to benefit from both 3D technology and high yield. Moreover, Vertically-Partially-Connected 3D-NoC is flexible, due to the fact that the number, placement, and assignment of the vertical links in each layer can be decided based on the limitations and requirements of the design. However, there are challenges to present a feasible and high-performance Vertically-Partially-Connected Mesh-based 3D-NoC due to the removed vertical links between the layers. This thesis addresses the challenges of Vertically-Partially-Connected Mesh-based 3D-NoC: Routing is the major problem of the Vertically-Partially-Connected 3D-NoC. Since some vertical links are removed, some of the routers do not have up or/and down ports. Therefore, there should be a path to send a packet to upper or lower layer which obviously has to be determined by a routing algorithm. The suggested paths should not cause deadlock through the network. To cope with this problem we explain and evaluate a deadlock- and livelock-free routing algorithm called Elevator First. Fundamentally, the NoC performance is affected by both 1) micro-architecture of routers and 2) architecture of interconnection. The router architecture has a significant effect on the performance of NoC, as it is a part of transportation delay. Therefore, the simplicity and efficiency of the design of NoC router micro architecture are the critical issues, especially in Vertically-Partially-Connected 3D-NoC which has already suffered from high average latency due to some removed vertical links. Therefore, we present the design and implementation the micro-architecture of a router which not only exactly and quickly transfers the packets based on the Elevator First routing algorithm, but it also consumes a reasonable amount of area and power. From the architecture point of view, the number and placement of vertical links have a key role in the performance of the Vertically-Partially-Connected Mesh-based 3D-NoC, since they affect the average hop-count and link and buffer utilization in the network. Furthermore, the assignment of the vertical links to the routers which do not have up or/and down port(s) is an important issue which influences the performance of the 3D routers. Therefore, the architectural exploration of Vertically-Partially-Connected Mesh-based 3D-NoC is both important and non-trivial. We define, study, and evaluate the parameters which describe the behavior of the network. The parameters can be helpful to place and assign the vertical links in the layers effectively. Finally, we propose a quadratic-based estimation method to anticipate the saturation threshold of the network's average latency.L'utilisation de la troisième dimension peut entraîner une réduction significative de la puissance et de la latence moyenne du trafic dans les réseaux sur puce (Network-on-Chip). La technologie des vias à travers le substrat (ou Through-Silicon Via) est la technologie la plus prometteuse pour l'intégration 3D, car elle offre des liens verticaux courts qui remédient au problème des longs fils dans les NoCs-2D. Les TSVs sont cependant énormes et les processus de fabrication sont immatures, ce qui réduit le rendement des systèmes sur puce à base de NoC-3D. Par conséquent, l'idée de réseaux sur puce 3D partiellement connectés verticalement a été introduite pour bénéficier de la technologie 3D tout en conservant un haut rendement. En outre, de tels réseaux sont flexibles, car le nombre, l'emplacement et l'affectation des liens verticaux dans chaque couche peuvent être décidés en fonction des exigences de l'application. Cependant, ce type de réseaux pose un certain nombre de défis : Le routage est le problème majeur, car l'élimination de certains liens verticaux fait que l'on ne peut utiliser les algorithmes classiques qui suivent l'ordre des dimensions. Pour répondre à cette question nous expliquons et évaluons un algorithme de routage déterministe appelé “Elevator First”, qui garanti d'une part que si un chemin existe, alors on le trouve, et que d'autre part il n'y aura pas d'interblocages. Fondamentalement, la performance du NoC est affecté par a) la micro architecture des routeurs et b) l'architecture d'interconnexion. L'architecture du routeur a un effet significatif sur la performance du NoC, à cause de la latence qu'il induit. Nous présentons la conception et la mise en œuvre de la micro-architecture d'un routeur à faible latence implantant​​l'algorithme de routage Elevator First, qui consomme une quantité raisonnable de surface et de puissance. Du point de vue de l'architecture, le nombre et le placement des liens verticaux ont un rôle important dans la performance des réseaux 3D partiellement connectés verticalement, car ils affectent le nombre moyen de sauts et le taux d'utilisation des FIFOs dans le réseau. En outre, l'affectation des liens verticaux vers les routeurs qui n'ont pas de ports vers le haut ou/et le bas est une question importante qui influe fortement sur les performances. Par conséquent, l'exploration architecturale des réseaux sur puce 3D partiellement connectés verticalement est importante. Nous définissons, étudions et évaluons des paramètres qui décrivent le comportement du réseau, de manière à déterminer le placement et l'affectation des liens verticaux dans les couches de manière simple et efficace. Nous proposons une méthode d'estimation quadratique visantà anticiper le seuil de saturation basée sur ces paramètres

    Cost Effective Routing Implementations for On-chip Networks

    Full text link
    Arquitecturas de múltiples núcleos como multiprocesadores (CMP) y soluciones multiprocesador para sistemas dentro del chip (MPSoCs) actuales se basan en la eficacia de las redes dentro del chip (NoC) para la comunicación entre los diversos núcleos. Un diseño eficiente de red dentro del chip debe ser escalable y al mismo tiempo obtener valores ajustados de área, latencia y consumo de energía. Para diseños de red dentro del chip de propósito general se suele usar topologías de malla 2D ya que se ajustan a la distribución del chip. Sin embargo, la aparición de nuevos retos debe ser abordada por los diseñadores. Una mayor probabilidad de defectos de fabricación, la necesidad de un uso optimizado de los recursos para aumentar el paralelismo a nivel de aplicación o la necesidad de técnicas eficaces de ahorro de energía, puede ocasionar patrones de irregularidad en las topologías. Además, el soporte para comunicación colectiva es una característica buscada para abordar con eficacia las necesidades de comunicación de los protocolos de coherencia de caché. En estas condiciones, un encaminamiento eficiente de los mensajes se convierte en un reto a superar. El objetivo de esta tesis es establecer las bases de una nueva arquitectura para encaminamiento distribuido basado en lógica que es capaz de adaptarse a cualquier topología irregular derivada de una estructura de malla 2D, proporcionando así una cobertura total para cualquier caso resultado de soportar los retos mencionados anteriormente. Para conseguirlo, en primer lugar, se parte desde una base, para luego analizar una evolución de varios mecanismos, y finalmente llegar a una implementación, que abarca varios módulos para alcanzar el objetivo mencionado anteriormente. De hecho, esta última implementación tiene por nombre eLBDR (effective Logic-Based Distributed Routing). Este trabajo cubre desde el primer mecanismo, LBDR, hasta el resto de mecanismos que han surgido progresivamente.Rodrigo Mocholí, S. (2010). Cost Effective Routing Implementations for On-chip Networks [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8962Palanci

    Fly-Over: A Light-Weight Distributed Router Power-Gating Mechanism for Energy-Efficient Interconnects

    Get PDF
    Scalable Networks-on-chip (NoCs) have become the de facto interconnection mechanism in large scale Chip Multiprocessors. Not only are NoCs devouring a large fraction of the on-chip power budget but static NoC power consumption is becoming the dominant component as technology scales down. Hence reducing static NoC power consumption is critical for energy-efficient computing. Previous research has proposed to power-gate routers attached to inactive cores so as to save static power, but they either required centralized decision making and global network knowledge or a non-scalable escape ring network. In this paper, we propose Fly-Over (FLOV), a light-weight distributed mechanism for power gating routers, which encompasses FLOV router microarchitecture and a partition-based dynamic routing algorithm to maintain network functionality. With simple modifications to the baseline router microarchitecture, FLOV can facilitate fly-over links over power-gated routers. The proposed routing algorithm provides best-effort minimal path routing without the necessity for global network information. We evaluate our scheme using both unicast and multicast synthetic workloads as well as real workloads from PARSEC 2.1 benchmark suite. The results show that FLOV can achieve 19.2% latency reduction and 16.9% total power savings

    Quarc: an architecture for efficient on-chip communication

    Get PDF
    The exponential downscaling of the feature size has enforced a paradigm shift from computation-based design to communication-based design in system on chip development. Buses, the traditional communication architecture in systems on chip, are incapable of addressing the increasing bandwidth requirements of future large systems. Networks on chip have emerged as an interconnection architecture offering unique solutions to the technological and design issues related to communication in future systems on chip. The transition from buses as a shared medium to networks on chip as a segmented medium has given rise to new challenges in system on chip realm. By leveraging the shared nature of the communication medium, buses have been highly efficient in delivering multicast communication. The segmented nature of networks, however, inhibits the multicast messages to be delivered as efficiently by networks on chip. Relying on extensive research on multicast communication in parallel computers, several network on chip architectures have offered mechanisms to perform the operation, while conforming to resource constraints of the network on chip paradigm. Multicast communication in majority of these networks on chip is implemented by establishing a connection between source and all multicast destinations before the message transmission commences. Establishing the connections incurs an overhead and, therefore, is not desirable; in particular in latency sensitive services such as cache coherence. To address high performance multicast communication, this research presents Quarc, a novel network on chip architecture. The Quarc architecture targets an area-efficient, low power, high performance implementation. The thesis covers a detailed representation of the building blocks of the architecture, including topology, router and network interface. The cost and performance comparison of the Quarc architecture against other network on chip architectures reveals that the Quarc architecture is a highly efficient architecture. Moreover, the thesis introduces novel performance models of complex traffic patterns, including multicast and quality of service-aware communication
    corecore