216 research outputs found

    Tinsel: a manythread overlay for FPGA clusters

    Get PDF
    Commodity FPGA boards with advanced networking facilities have great potential in the construction of high-performance compute clusters that scale. However, low-level design tools and long synthesis times are major barriers to productivity for application developers. In this paper, we explore the potential of a distributed soft-processor overlay, programmed in software at a high-level of abstraction, to deliver a useful level of performance for FPGA clusters. In particular, we demonstrate the use of hardware multhreading to achieve a fast, space-efficient, high-throughput overlay, and compare a 12-FPGA instance of it (12,288 RISC-V threads) against a conventional Xeon cluster on the problem of distributed graph processing.This work was supported by EPSRC grant EP/N031768/1 (POETS project)

    A Mathematical Model for Evaluating the Performance of Multicast Systems

    Get PDF
    © 2008 IEEE. Reprinted, with permission, from Syed S. Rizvi, Aasia Riasat, and Khaled M. Elleith, "A Mathematical Model for Evaluating the Performance of Multicast Systems," The 1st IEEE International Workshop on IP Multimedia Communications (IPMC 2008) August 4 - 7, 2008, St. Thomas U.S. Virgin IslandsThe Internet is experiencing the demand of high-speed real-time applications, such as live streaming multimedia, videoconferencing, and multiparty games. IP multicast is an efficient transmission technique to support these applications. However, there are several architectural issues in this technique that hinder the development and the deployment of IP multicast such as a lack of an efficient multicast address allocation scheme. On the other hand, End System Multicasting (ESM) is a very promising application-layer scheme where all the multicast functionality is shifted to the end-users. Supporting high-speed real-time applications always demand a sound understanding of these schemes and the factors that might affect the end-user requirements. In this paper we attempt to propose both analytical and the mathematical models for characterizing the performance of IP multicast and ESM. Our proposed mathematical model can be used to design and implement a more efficient and robust ESM model for the future networks

    Accelerating Neural Network Inference with Processing-in-DRAM: From the Edge to the Cloud

    Full text link
    Neural networks (NNs) are growing in importance and complexity. A neural network's performance (and energy efficiency) can be bound either by computation or memory resources. The processing-in-memory (PIM) paradigm, where computation is placed near or within memory arrays, is a viable solution to accelerate memory-bound NNs. However, PIM architectures vary in form, where different PIM approaches lead to different trade-offs. Our goal is to analyze, discuss, and contrast DRAM-based PIM architectures for NN performance and energy efficiency. To do so, we analyze three state-of-the-art PIM architectures: (1) UPMEM, which integrates processors and DRAM arrays into a single 2D chip; (2) Mensa, a 3D-stack-based PIM architecture tailored for edge devices; and (3) SIMDRAM, which uses the analog principles of DRAM to execute bit-serial operations. Our analysis reveals that PIM greatly benefits memory-bound NNs: (1) UPMEM provides 23x the performance of a high-end GPU when the GPU requires memory oversubscription for a general matrix-vector multiplication kernel; (2) Mensa improves energy efficiency and throughput by 3.0x and 3.1x over the Google Edge TPU for 24 Google edge NN models; and (3) SIMDRAM outperforms a CPU/GPU by 16.7x/1.4x for three binary NNs. We conclude that the ideal PIM architecture for NN models depends on a model's distinct attributes, due to the inherent architectural design choices.Comment: This is an extended and updated version of a paper published in IEEE Micro, pp. 1-14, 29 Aug. 2022. arXiv admin note: text overlap with arXiv:2109.1432

    HoPP: Robust and Resilient Publish-Subscribe for an Information-Centric Internet of Things

    Full text link
    This paper revisits NDN deployment in the IoT with a special focus on the interaction of sensors and actuators. Such scenarios require high responsiveness and limited control state at the constrained nodes. We argue that the NDN request-response pattern which prevents data push is vital for IoT networks. We contribute HoP-and-Pull (HoPP), a robust publish-subscribe scheme for typical IoT scenarios that targets IoT networks consisting of hundreds of resource constrained devices at intermittent connectivity. Our approach limits the FIB tables to a minimum and naturally supports mobility, temporary network partitioning, data aggregation and near real-time reactivity. We experimentally evaluate the protocol in a real-world deployment using the IoT-Lab testbed with varying numbers of constrained devices, each wirelessly interconnected via IEEE 802.15.4 LowPANs. Implementations are built on CCN-lite with RIOT and support experiments using various single- and multi-hop scenarios

    Exploration architecturale et étude des performances des réseaux sur puce 3D partiellement connectés verticalement

    Get PDF
    Utilization of the third dimension can lead to a significant reduction in power and average hop-count in Networks- on-Chip (NoC). TSV technology, as the most promising technology in 3D integration, offers short and fast vertical links which copes with the long wire problem in 2D NoCs. Nonetheless, TSVs are huge and their manufacturing process is still immature, which reduces the yield of 3D NoC based SoC. Therefore, Vertically-Partially-Connected 3D-NoC has been introduced to benefit from both 3D technology and high yield. Moreover, Vertically-Partially-Connected 3D-NoC is flexible, due to the fact that the number, placement, and assignment of the vertical links in each layer can be decided based on the limitations and requirements of the design. However, there are challenges to present a feasible and high-performance Vertically-Partially-Connected Mesh-based 3D-NoC due to the removed vertical links between the layers. This thesis addresses the challenges of Vertically-Partially-Connected Mesh-based 3D-NoC: Routing is the major problem of the Vertically-Partially-Connected 3D-NoC. Since some vertical links are removed, some of the routers do not have up or/and down ports. Therefore, there should be a path to send a packet to upper or lower layer which obviously has to be determined by a routing algorithm. The suggested paths should not cause deadlock through the network. To cope with this problem we explain and evaluate a deadlock- and livelock-free routing algorithm called Elevator First. Fundamentally, the NoC performance is affected by both 1) micro-architecture of routers and 2) architecture of interconnection. The router architecture has a significant effect on the performance of NoC, as it is a part of transportation delay. Therefore, the simplicity and efficiency of the design of NoC router micro architecture are the critical issues, especially in Vertically-Partially-Connected 3D-NoC which has already suffered from high average latency due to some removed vertical links. Therefore, we present the design and implementation the micro-architecture of a router which not only exactly and quickly transfers the packets based on the Elevator First routing algorithm, but it also consumes a reasonable amount of area and power. From the architecture point of view, the number and placement of vertical links have a key role in the performance of the Vertically-Partially-Connected Mesh-based 3D-NoC, since they affect the average hop-count and link and buffer utilization in the network. Furthermore, the assignment of the vertical links to the routers which do not have up or/and down port(s) is an important issue which influences the performance of the 3D routers. Therefore, the architectural exploration of Vertically-Partially-Connected Mesh-based 3D-NoC is both important and non-trivial. We define, study, and evaluate the parameters which describe the behavior of the network. The parameters can be helpful to place and assign the vertical links in the layers effectively. Finally, we propose a quadratic-based estimation method to anticipate the saturation threshold of the network's average latency.L'utilisation de la troisième dimension peut entraîner une réduction significative de la puissance et de la latence moyenne du trafic dans les réseaux sur puce (Network-on-Chip). La technologie des vias à travers le substrat (ou Through-Silicon Via) est la technologie la plus prometteuse pour l'intégration 3D, car elle offre des liens verticaux courts qui remédient au problème des longs fils dans les NoCs-2D. Les TSVs sont cependant énormes et les processus de fabrication sont immatures, ce qui réduit le rendement des systèmes sur puce à base de NoC-3D. Par conséquent, l'idée de réseaux sur puce 3D partiellement connectés verticalement a été introduite pour bénéficier de la technologie 3D tout en conservant un haut rendement. En outre, de tels réseaux sont flexibles, car le nombre, l'emplacement et l'affectation des liens verticaux dans chaque couche peuvent être décidés en fonction des exigences de l'application. Cependant, ce type de réseaux pose un certain nombre de défis : Le routage est le problème majeur, car l'élimination de certains liens verticaux fait que l'on ne peut utiliser les algorithmes classiques qui suivent l'ordre des dimensions. Pour répondre à cette question nous expliquons et évaluons un algorithme de routage déterministe appelé “Elevator First”, qui garanti d'une part que si un chemin existe, alors on le trouve, et que d'autre part il n'y aura pas d'interblocages. Fondamentalement, la performance du NoC est affecté par a) la micro architecture des routeurs et b) l'architecture d'interconnexion. L'architecture du routeur a un effet significatif sur la performance du NoC, à cause de la latence qu'il induit. Nous présentons la conception et la mise en œuvre de la micro-architecture d'un routeur à faible latence implantant​​l'algorithme de routage Elevator First, qui consomme une quantité raisonnable de surface et de puissance. Du point de vue de l'architecture, le nombre et le placement des liens verticaux ont un rôle important dans la performance des réseaux 3D partiellement connectés verticalement, car ils affectent le nombre moyen de sauts et le taux d'utilisation des FIFOs dans le réseau. En outre, l'affectation des liens verticaux vers les routeurs qui n'ont pas de ports vers le haut ou/et le bas est une question importante qui influe fortement sur les performances. Par conséquent, l'exploration architecturale des réseaux sur puce 3D partiellement connectés verticalement est importante. Nous définissons, étudions et évaluons des paramètres qui décrivent le comportement du réseau, de manière à déterminer le placement et l'affectation des liens verticaux dans les couches de manière simple et efficace. Nous proposons une méthode d'estimation quadratique visantà anticiper le seuil de saturation basée sur ces paramètres

    Learning algorithms for the control of routing in integrated service communication networks

    Get PDF
    There is a high degree of uncertainty regarding the nature of traffic on future integrated service networks. This uncertainty motivates the use of adaptive resource allocation policies that can take advantage of the statistical fluctuations in the traffic demands. The adaptive control mechanisms must be 'lightweight', in terms of their overheads, and scale to potentially large networks with many traffic flows. Adaptive routing is one form of adaptive resource allocation, and this thesis considers the application of Stochastic Learning Automata (SLA) for distributed, lightweight adaptive routing in future integrated service communication networks. The thesis begins with a broad critical review of the use of Artificial Intelligence (AI) techniques applied to the control of communication networks. Detailed simulation models of integrated service networks are then constructed, and learning automata based routing is compared with traditional techniques on large scale networks. Learning automata are examined for the 'Quality-of-Service' (QoS) routing problem in realistic network topologies, where flows may be routed in the network subject to multiple QoS metrics, such as bandwidth and delay. It is found that learning automata based routing gives considerable blocking probability improvements over shortest path routing, despite only using local connectivity information and a simple probabilistic updating strategy. Furthermore, automata are considered for routing in more complex environments spanning issues such as multi-rate traffic, trunk reservation, routing over multiple domains, routing in high bandwidth-delay product networks and the use of learning automata as a background learning process. Automata are also examined for routing of both 'real-time' and 'non-real-time' traffics in an integrated traffic environment, where the non-real-time traffic has access to the bandwidth 'left over' by the real-time traffic. It is found that adopting learning automata for the routing of the real-time traffic may improve the performance to both real and non-real-time traffics under certain conditions. In addition, it is found that one set of learning automata may route both traffic types satisfactorily. Automata are considered for the routing of multicast connections in receiver-oriented, dynamic environments, where receivers may join and leave the multicast sessions dynamically. Automata are shown to be able to minimise the average delay or the total cost of the resulting trees using the appropriate feedback from the environment. Automata provide a distributed solution to the dynamic multicast problem, requiring purely local connectivity information and a simple updating strategy. Finally, automata are considered for the routing of multicast connections that require QoS guarantees, again in receiver-oriented dynamic environments. It is found that the distributed application of learning automata leads to considerably lower blocking probabilities than a shortest path tree approach, due to a combination of load balancing and minimum cost behaviour
    corecore