Search CORE

574 research outputs found

General hardware multicasting for fine-grained message-passing architectures

Author: Shane Fleming
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

Manycore architectures are increasingly favouring message-passing or partitioned global address spaces (PGAS) over cache coherency for reasons of power efficiency and scalability. However, in the absence of cache coherency, there can be a lack of hardware support for one-to-many communication patterns, which are prevalent in someapplication domains. To address this, we present new hardware primitives for multicast communication in rack-scale manycore systems. These primitives guarantee delivery to both colocated and distributed destinations, and can capture large unstructured communication patterns precisely. As a result, reliable multicast transfers among any number of software tasks, connected in any topology, can be fully offloaded to hardware. We implement the new primitives in a research platform consisting of 50K RISC-V threads distributed over 48 FPGAs, and demonstrate significant performance benefits on a range of applications expressed using a high-level vertex-centric programming model

Cronfa at Swansea University

General hardware multicasting for fine-grained message-passing architectures

Author: Beaumont JR
Brown A
Bytheway T
Fleming S
Markettos AT
Moore SW
Naylor M
Thomas D
Vousden M
Publication venue: Proceedings - 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2021
Publication date: 01/01/2021
Field of study

Manycore architectures are increasingly favouring message-passing or partitioned global address spaces (PGAS) over cache coherency for reasons of power efficiency and scalability. However, in the absence of cache coherency, there can be a lack of hardware support for one-to-many communication patterns, which are prevalent in some application domains. To address this, we present new hardware primitives for multicast communication in rack-scale manycore systems. These primitives guarantee delivery to both colocated and distributed destinations, and can capture large unstructured communication patterns precisely. As a result, reliable multicast transfers among any number of software tasks, connected in any topology, can be fully offloaded to hardware. We implement the new primitives in a research platform consisting of 50K RISC-V threads distributed over 48 FPGAs, and demonstrate significant performance benefits on a range of applications expressed using a high-level vertex-centric programming model

Southampton (e-Prints Soton)

Apollo (Cambridge)

Distributed k-ary System: Algorithms for Distributed Hash Tables

Author: Ghodsi Ali
Publication venue
Publication date: 01/01/2006
Field of study

This dissertation presents algorithms for data structures called distributed hash tables (DHT) or structured overlay networks, which are used to build scalable self-managing distributed systems. The provided algorithms guarantee lookup consistency in the presence of dynamism: they guarantee consistent lookup results in the presence of nodes joining and leaving. Similarly, the algorithms guarantee that routing never fails while nodes join and leave. Previous algorithms for lookup consistency either suffer from starvation, do not work in the presence of failures, or lack proof of correctness. Several group communication algorithms for structured overlay networks are presented. We provide an overlay broadcast algorithm, which unlike previous algorithms avoids redundant messages, reaching all nodes in O(log n) time, while using O(n) messages, where n is the number of nodes in the system. The broadcast algorithm is used to build overlay multicast. We introduce bulk operation, which enables a node to efficiently make multiple lookups or send a message to all nodes in a specified set of identifiers. The algorithm ensures that all specified nodes are reached in O(log n) time, sending maximum O(log n) messages per node, regardless of the input size of the bulk operation. Moreover, the algorithm avoids sending redundant messages. Previous approaches required multiple lookups, which consume more messages and can render the initiator a bottleneck. Our algorithms are used in DHT-based storage systems, where nodes can do thousands of lookups to fetch large files. We use the bulk operation algorithm to construct a pseudo-reliable broadcast algorithm. Bulk operations can also be used to implement efficient range queries. Finally, we describe a novel way to place replicas in a DHT, called symmetric replication, that enables parallel recursive lookups. Parallel lookups are known to reduce latencies. However, costly iterative lookups have previously been used to do parallel lookups. Moreover, joins or leaves only require exchanging O(1) messages, while other schemes require at least log(f) messages for a replication degree of f. The algorithms have been implemented in a middleware called the Distributed k-ary System (DKS), which is briefly described

Publikationer från KTH

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Software institutes' Online Digital Archive

Study of the Topology Mismatch Problem in Peer-to-Peer Networks

Author: Lalitha B
Rao Dr Ch D V Subba
Publication venue: The International Institute for Science, Technology and Education (IISTE)
Publication date: 31/07/2012
Field of study

The advantages of peer-to-peer (P2P) technology are innumerable when compared to other systems like Distributed Messaging System, Client-Server model, Cloud based systems. The vital advantages are not limited to high scalability and low cost. On the other hand the p2p system suffers from a bottle-neck problem caused by topology mismatch. Topology mismatch occurs in an unstructured peer-to-peer (P2P) network when the peers participating in the communication choose their neighbors in random fashion, such that the resultant P2P network mismatches its underlying physical network, resulting in a lengthy communication between the peers and redundant network traffics generated in the underlying network[1] However, most P2P system performance suffers from the mismatch between the overlays topology and the underlying physical network topology, causing a large volume of redundant traffic in the Internet slowing the performance. This paper surveys the P2P topology mismatch problems and the solutions adapted for different applications

International Institute for Science, Technology and Education (IISTE): E-Journals

Structured P2P Technologies for Distributed Command and Control

Author: Karrels Daniel R.
Mullins Barry E.
Peterson Gilbert L.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2009
Field of study

The utility of Peer-to-Peer (P2P) systems extends far beyond traditional file sharing. This paper provides an overview of how P2P systems are capable of providing robust command and control for Distributed Multi-Agent Systems (DMASs). Specifically, this article presents the evolution of P2P architectures to date by discussing supporting technologies and applicability of each generation of P2P systems. It provides a detailed survey of fundamental design approaches found in modern large-scale P2P systems highlighting design considerations for building and deploying scalable P2P applications. The survey includes unstructured P2P systems, content retrieval systems, communications structured P2P systems, flat structured P2P systems and finally Hierarchical Peer-to-Peer (HP2P) overlays. It concludes with a presentation of design tradeoffs and opportunities for future research into P2P overlay systems

AFTI Scholar (Air Force Institute of Technology)

7. GI/ITG KuVS Fachgespräch Drahtlose Sensornetze

Author: Ritter Hartmut
Schiller Jochen
Terfloth Kirsten
Wittenburg Georg
Publication venue
Publication date: 01/01/2008
Field of study

In dem vorliegenden Tagungsband sind die Beiträge des Fachgesprächs Drahtlose Sensornetze 2008 zusammengefasst. Ziel dieses Fachgesprächs ist es, Wissenschaftlerinnen und Wissenschaftler aus diesem Gebiet die Möglichkeit zu einem informellen Austausch zu geben – wobei immer auch Teilnehmer aus der Industrieforschung willkommen sind, die auch in diesem Jahr wieder teilnehmen.Das Fachgespräch ist eine betont informelle Veranstaltung der GI/ITG-Fachgruppe „Kommunikation und Verteilte Systeme“ (www.kuvs.de). Es ist ausdrücklich keine weitere Konferenz mit ihrem großen Overhead und der Anforderung, fertige und möglichst „wasserdichte“ Ergebnisse zu präsentieren, sondern es dient auch ganz explizit dazu, mit Neueinsteigern auf der Suche nach ihrem Thema zu diskutieren und herauszufinden, wo die Herausforderungen an die zukünftige Forschung überhaupt liegen.Das Fachgespräch Drahtlose Sensornetze 2008 findet in Berlin statt, in den Räumen der Freien Universität Berlin, aber in Kooperation mit der ScatterWeb GmbH. Auch dies ein Novum, es zeigt, dass das Fachgespräch doch deutlich mehr als nur ein nettes Beisammensein unter einem Motto ist.Für die Organisation des Rahmens und der Abendveranstaltung gebührt Dank den beiden Mitgliedern im Organisationskomitee, Kirsten Terfloth und Georg Wittenburg, aber auch Stefanie Bahe, welche die redaktionelle Betreuung des Tagungsbands übernommen hat, vielen anderen Mitgliedern der AG Technische Informatik der FU Berlin und natürlich auch ihrem Leiter, Prof. Jochen Schiller

Institutional Repository of the Freie Universität Berlin

Managing HBM’s bandwidth in Multi-Die FPGAs using Overlay NoCs

Author: Kuttuva Prakash Srinirdheeshwar
Publication venue: 'University of Waterloo'
Publication date: 07/01/2022
Field of study

We can improve HBM bandwidth distribution and utilization on a multi-die FPGA like Xilinx Alveo U280 by using Overlay Network-on-Chips (NoCs). HBM in Xilinx Alveo U280 offers 8GBs of memory capacity with a theoretical maximum bandwidth of 460 GBps, but all the thirty-two HBM ports in Xilinx Alveo U280 are exposed to the FPGA fabric in only one die. As a result, processing elements assigned to other dies must use the scarcely available and challenging to use Super Long Lines (SLL) to access the HBM’s bandwidth. Furthermore, HBM is fractured internally into thirty-two smaller memories called pseudo channels. They are connected together by a hardened and flawed cross-bar, which enables global accesses from any of the HBM ports, but introduces several throughput bottlenecks, degrading the achievable throughput when the entire memory space is used. An Overlay Hybrid NoC combining the features of Hoplite and Butterfly Fat Trees (BFT) NoC offers a high-frequency solution for distributing HBM’s bandwidth across all three dies, as well as overcoming the throughput bottleneck introduced by the internal cross-bar. The Hybrid NoC combines multiple high-frequency Ring NoCs for inter-die communication and Butterfly Fat tree NoCs for intra-die communication. In addition, the routing capability of the NoC can be modified to supplant the HBM’s internal cross-bar for global accesses. We demonstrate this in Xilinx Alveo 280 using synthetic benchmarks and two application-based benchmarks, Dense matrix-matrix multiplication (DMM) and Sparse Matrix-Vector multiplication (SPMV). Our experiments show that NoCs can improve throughput utilization by as much as ×8.6 for single-flit global accesses,×1.7 for multi-flit global accesses with burst length 16, and as much as ×1.4 for SpMV benchmark

University of Waterloo's Institutional Repository

C-MOS array design techniques: SUMC multiprocessor system study

Author: Clapp W. A.
Helbig W. A.
Merriam A. S.
Publication venue
Publication date
Field of study

The current capabilities of LSI techniques for speed and reliability, plus the possibilities of assembling large configurations of LSI logic and storage elements, have demanded the study of multiprocessors and multiprocessing techniques, problems, and potentialities. Evaluated are three previous systems studies for a space ultrareliable modular computer multiprocessing system, and a new multiprocessing system is proposed that is flexibly configured with up to four central processors, four 1/0 processors, and 16 main memory units, plus auxiliary memory and peripheral devices. This multiprocessor system features a multilevel interrupt, qualified S/360 compatibility for ground-based generation of programs, virtual memory management of a storage hierarchy through 1/0 processors, and multiport access to multiple and shared memory units

NASA Technical Reports Server

Kestrel: Job Distribution and Scheduling using XMPP

Author: Stout Lance
Publication venue: Clemson University Libraries
Publication date: 01/05/2011
Field of study

A new distributed computing framework, named Kestrel, for Many-Task Computing (MTC) applications and implementing Virtual Organization Clusters (VOCs) is proposed. Kestrel is a lightweight, highly available system based on the Extensible Messaging and Presence Protocol (XMPP), and has been developed to explore XMPP-based techniques for improving MTC and VOC tolerance to faults due to scaling and intermittently connected heterogeneous resources. Kestrel provides a VOC with a special purpose scheduler for VOCs which can provide better scalability under certain workload assumptions, namely CPU bound processes and bag-of-task applications. Experimental results have shown that Kestrel is capable of operating a VOC of at least 1600 worker nodes with all nodes visible to the scheduler at once. When using multiple sites located in both North America and Europe, the latencies introduced to the round trip time of messages were on the order of 0.3 seconds. To offset the overhead of XMPP processing, a task execution time of 2 seconds is sufficient for a pool of 900 workers on a single site to operate at near 100% use. Requiring tasks that take on the order of 30 seconds to a minute to execute would compensate for increased latency during job dispatch across multiple sites. Kestrel\u27s architecture is rooted in pilot job frameworks heavily used in Grid computing, it is also modeled after the use of IRC by botnets to communicate between compromised machines and command and control servers. For Kestrel, the extensibility of XMPP has allowed development of protocols for identifying manager nodes, discovering the capabilities of worker agents, and for distributing tasks. The presence notifications provided by XMPP allow Kestrel to monitor the global state of the pool and to perform task dispatching based on worker availability. In this work it is argued that XMPP is by design a very good fit for cloud computing frameworks. It offers scalability, federation between servers and some autonomicity of the agents. During the summer of 2010, Kestrel was used and modified based on feedback from the STAR group at Brookhaven National Laboratories. STAR provided a virtual machine image with applications for simulating proton collisions using PYTHIA and GEANT3. A Kestrel-based virtual organization cluster, created on top of Clemson University\u27s Palmetto cluster, was able to provide over 400,000 CPU hours of computation over the course of a month using an average of 800 virtual machine instances every day, generating nearly seven terabytes of data and the largest PYTHIA production run that STAR ever achieved. Several architectural issues were encountered during the course of the experiment and were resolved by moving from the original JSON protocols used by Kestrel to native XMPP equivalents that offered better message delivery confirmation and integration with existing tools

Clemson University: TigerPrints