Search CORE

895 research outputs found

APEnet+: high bandwidth 3D torus direct network for petaflops scale commodity clusters

Author: A Biagioni
A Lonardo
A Salamon
Ammendola R
Ammendola R
Ammendola R
Ammendola R
Ammendola R
Bodin F
Chalasani Suresh
D Rossetti
F Lo Cicero
F Simula
G Salina
L Tosoratto
NVIDIA Corporation
O Prezza
P S Paolucci
P Vicini
Paolucci P S
Paolucci P S
R Ammendola
Publication venue: 'IOP Publishing'
Publication date: 18/02/2011
Field of study

We describe herein the APElink+ board, a PCIe interconnect adapter featuring the latest advances in wire speed and interface technology plus hardware support for a RDMA programming model and experimental acceleration of GPU networking; this design allows us to build a low latency, high bandwidth PC cluster, the APEnet+ network, the new generation of our cost-effective, tens-of-thousands-scalable cluster network architecture. Some test results and characterization of data transmission of a complete testbench, based on a commercial development card mounting an Altera FPGA, are provided.Comment: 6 pages, 7 figures, proceeding of CHEP 2010, Taiwan, October 18-2

arXiv.org e-Print Archive

Crossref

APEnet+: a 3D toroidal network enabling Petaflops scale Lattice QCD simulations on commodity clusters

Author: Ammendola Roberto
Biagioni Andrea
Cicero Francesca Lo
Frezza Ottorino
Lonardo Alessandro
Paolucci Pier
Petronzio Roberto
Rossetti Davide
Salamon Andrea
Salina Gaetano
Simula Francesco
Tantalo Nazario
Tosoratto Laura
Vicini Piero
Publication venue
Publication date: 01/01/2010
Field of study

Many scientific computations need multi-node parallelism for matching up both space (memory) and time (speed) ever-increasing requirements. The use of GPUs as accelerators introduces yet another level of complexity for the programmer and may potentially result in large overheads due to the complex memory hierarchy. Additionally, top-notch problems may easily employ more than a Petaflops of sustained computing power, requiring thousands of GPUs orchestrated with some parallel programming model. Here we describe APEnet+, the new generation of our interconnect, which scales up to tens of thousands of nodes with linear cost, thus improving the price/performance ratio on large clusters. The project target is the development of the Apelink+ host adapter featuring a low latency, high bandwidth direct network, state-of-the-art wire speeds on the links and a PCIe X8 gen2 host interface. It features hardware support for the RDMA programming model and experimental acceleration of GPU networking. A Linux kernel driver, a set of low-level RDMA APIs and an OpenMPI library driver are available, allowing for painless porting of standard applications. Finally, we give an insight of future work and intended developments

arXiv.org e-Print Archive

ART

NaNet: a Low-Latency, Real-Time, Multi-Standard Network Interface Card with GPUDirect Features

Author: Ameli F.
Ammendola R.
Biagioni A.
Frezza O.
Lamanna G.
Lo Cicero F.
Lonardo A.
Martinelli M.
Nicolau C.
Paolucci P.S.
Pastorelli E.
Pontisso L.
Rossetti D.
Simeone F.
Simula F.
Sozzi M.
Tosoratto L.
Vicini P.
Publication venue
Publication date: 13/06/2014
Field of study

While the GPGPU paradigm is widely recognized as an effective approach to high performance computing, its adoption in low-latency, real-time systems is still in its early stages. Although GPUs typically show deterministic behaviour in terms of latency in executing computational kernels as soon as data is available in their internal memories, assessment of real-time features of a standard GPGPU system needs careful characterization of all subsystems along data stream path. The networking subsystem results in being the most critical one in terms of absolute value and fluctuations of its response latency. Our envisioned solution to this issue is NaNet, a FPGA-based PCIe Network Interface Card (NIC) design featuring a configurable and extensible set of network channels with direct access through GPUDirect to NVIDIA Fermi/Kepler GPU memories. NaNet design currently supports both standard - GbE (1000BASE-T) and 10GbE (10Base-R) - and custom - 34~Gbps APElink and 2.5~Gbps deterministic latency KM3link - channels, but its modularity allows for a straightforward inclusion of other link technologies. To avoid host OS intervention on data stream and remove a possible source of jitter, the design includes a network/transport layer offload module with cycle-accurate, upper-bound latency, supporting UDP, KM3link Time Division Multiplexing and APElink protocols. After NaNet architecture description and its latency/bandwidth characterization for all supported links, two real world use cases will be presented: the GPU-based low level trigger for the RICH detector in the NA62 experiment at CERN and the on-/off-shore data link for KM3 underwater neutrino telescope

arXiv.org e-Print Archive

CERN Document Server

CRoute: a fast high-quality timing-driven connection-based FPGA router

Author: Stroobandt Dirk
Vansteenkiste Elias
Vercruyce Dries
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

FPGA routing is an important part of physical design as the programmable interconnection network requires the majority of the total silicon area and the connections largely contribute to delay and power. It should also occur with minimum runtime to enable efficient design exploration. In this work we elaborate on the concept of the connection-based routing principle. The algorithm is improved and a timing-driven version is introduced. The router, called CROUTE, is implemented in an easy to adapt FPGA CAD framework written in Java, which is publicly available on GitHub. Quality and runtime are compared to the state-of-the-art router in VPR 7.0.7. Benchmarking is done with the TITAN23 design suite, which consists of large heterogeneous designs targeted to a detailed representation of the Stratix IV FPGA. CROUTE gains in both the total wirelength and maximum clock frequency while reducing the routing runtime. The total wire-length reduces by 11% and the maximum clock frequency increases by 6%. These high-quality results are obtained in 3.4x less routing runtime

Crossref

Ghent University Academic Bibliography

High throughput spatial convolution filters on FPGAs

Author: Al-Dujaili Abdullah
Fahmy Suhaib A.
Ioannou Lenos
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/04/2020
Field of study

Digital signal processing (DSP) on field- programmable gate arrays (FPGAs) has long been appealing because of the inherent parallelism in these computations that can be easily exploited to accelerate such algorithms. FPGAs have evolved significantly to further enhance the mapping of these algorithms, included additional hard blocks, such as the DSP blocks found in modern FPGAs. Although these DSP blocks can offer more efficient mapping of DSP computations, they are primarily designed for 1-D filter structures. We present a study on spatial convolutional filter implementations on FPGAs, optimizing around the structure of the DSP blocks to offer high throughput while maintaining the coefficient flexibility that other published architectures usually sacrifice. We show that it is possible to implement large filters for large 4K resolution image frames at frame rates of 30–60 FPS, while maintaining functional flexibility

Warwick Research Archives Portal Repository

Recommended from our members

On-Chip Communication and Security in FPGAs

Author: Patil Shivukumar Basanagouda
Publication venue: ScholarWorks@UMass Amherst
Publication date: 25/10/2018
Field of study

Innovations in Field Programmable Gate Array (FPGA) manufacturing processes and architectural design have led to the development of extremely large FPGAs. There has also been a widespread adaptation of these large FPGAs in cloud infrastructures and data centers to accelerate search and machine learning applications. Two important topics related to FPGAs are addressed in this work: on-chip communication and security. On-chip communication is quickly becoming a bottleneck in to- day’s large multi-million gate FPGAs. Hard Networks-on-Chip (NoC), made of fixed silicon, have been shown to provide low power, high speed, flexible on-chip communication. An iterative algorithm for routing pre-scheduled time-division-multiplexed paths in a hybrid NoC FPGA architecture is demonstrated in this thesis work. The routing algorithm is based on the well known Pathfinder algorithm, overcomes several limitations of a previous greedy implementation and successfully routes connections using a higher number of timeslots than greedy approaches. The new algorithm shows an average bandwidth improvement of 11% for unicast traffic and multicast traffic patterns. Regarding on-chip FPGA security, a recent study on covert channel communication in Xilinx FPGA devices has shown information leaking from long interconnect wires into immediate neighboring wires. This information leakage can be used by an attacker in a multi-tenant FPGA cloud infrastructure to non-invasively steal secret information from an unsuspecting user design. It is demonstrated that the information leakage is also present in Intel SRAM FPGAs. Information leakage in Cyclone-IV E and Stratix-V FPGA devices is quantified and characterized with varying parameters, and across different routing elements of the FPGAs

ScholarWorks@UMass Amherst