Search CORE

119 research outputs found

Network acceleration techniques

Author: Awrach James Michael
Crowley Patricia
Maccabe Arthur Barney
Publication venue
Publication date: 24/01/2012
Field of study

Splintered offloading techniques with receive batch processing are described for network acceleration. Such techniques offload specific functionality to a NIC while maintaining the bulk of the protocol processing in the host operating system ("OS"). The resulting protocol implementation allows the application to bypass the protocol processing of the received data. Such can be accomplished this by moving data from the NIC directly to the application through direct memory access ("DMA") and batch processing the receive headers in the host OS when the host OS is interrupted to perform other work. Batch processing receive headers allows the data path to be separated from the control path. Unlike operating system bypass, however, the operating system still fully manages the network resource and has relevant feedback about traffic and flows. Embodiments of the present disclosure can therefore address the challenges of networks with extreme bandwidth delay products (BWDP)

NASA Technical Reports Server

Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses

Author: Hwu Wen-mei
Mailthody Vikram Sharma
Park Jeongmin Brian
Qureshi Zaid
Publication venue
Publication date: 28/06/2023
Field of study

Graph Neural Networks (GNNs) are emerging as a powerful tool for learning from graph-structured data and performing sophisticated inference tasks in various application domains. Although GNNs have been shown to be effective on modest-sized graphs, training them on large-scale graphs remains a significant challenge due to lack of efficient data access and data movement methods. Existing frameworks for training GNNs use CPUs for graph sampling and feature aggregation, while the training and updating of model weights are executed on GPUs. However, our in-depth profiling shows the CPUs cannot achieve the throughput required to saturate GNN model training throughput, causing gross under-utilization of expensive GPU resources. Furthermore, when the graph and its embeddings do not fit in the CPU memory, the overhead introduced by the operating system, say for handling page-faults, comes in the critical path of execution. To address these issues, we propose the GPU Initiated Direct Storage Access (GIDS) dataloader, to enable GPU-oriented GNN training for large-scale graphs while efficiently utilizing all hardware resources, such as CPU memory, storage, and GPU memory with a hybrid data placement strategy. By enabling GPU threads to fetch feature vectors directly from storage, GIDS dataloader solves the memory capacity problem for GPU-oriented GNN training. Moreover, GIDS dataloader leverages GPU parallelism to tolerate storage latency and eliminates expensive page-fault overhead. Doing so enables us to design novel optimizations for exploiting locality and increasing effective bandwidth for GNN training. Our evaluation using a single GPU on terabyte-scale GNN datasets shows that GIDS dataloader accelerates the overall DGL GNN training pipeline by up to 392X when compared to the current, state-of-the-art DGL dataloader.Comment: Under Submission. Source code: https://github.com/jeongminpark417/GID

arXiv.org e-Print Archive

NaNet:a low-latency NIC enabling GPU-based, real-time low level trigger systems

Author: Ammendola Roberto
Biagioni Andrea
Cicero Francesca Lo
Fantechi Riccardo
Frezza Ottorino
Lamanna Gianluca
Lonardo Alessandro
Pantaleo Felice
Paolucci Pier Stanislao
Piandani Roberto
Pontisso Luca
Rossetti Davide
Simula Francesco
Sozzi Marco
Tosoratto Laura
Vicini Piero
Publication venue: 'IOP Publishing'
Publication date: 05/11/2013
Field of study

We implemented the NaNet FPGA-based PCI2 Gen2 GbE/APElink NIC, featuring GPUDirect RDMA capabilities and UDP protocol management offloading. NaNet is able to receive a UDP input data stream from its GbE interface and redirect it, without any intermediate buffering or CPU intervention, to the memory of a Fermi/Kepler GPU hosted on the same PCIe bus, provided that the two devices share the same upstream root complex. Synthetic benchmarks for latency and bandwidth are presented. We describe how NaNet can be employed in the prototype of the GPU-based RICH low-level trigger processor of the NA62 CERN experiment, to implement the data link between the TEL62 readout boards and the low level trigger processor. Results for the throughput and latency of the integrated system are presented and discussed.Comment: Proceedings for the 20th International Conference on Computing in High Energy and Nuclear Physics (CHEP

arXiv.org e-Print Archive

CERN Document Server

Network Vortex: Distributed Virtual Memory for Streaming Applications

Author: Gluck Eta Syra
Publication venue
Publication date: 09/08/2022
Field of study

Explosive growth of the Internet, cluster computing, and storage technology has led to generation of enormous volumes of information and the need for scalable data computing. One of the central frameworks for such analysis is MapReduce, which is a programming platform for processing streaming data in external/distributed memory. Despite a significant public effort, open-source implementations of MapReduce (e.g., Hadoop, Spark) are complicated, bulky, and inefficient. To overcome this problem, we explore employing and expanding upon a recent a C/C++ programming abstraction called Vortex that offers a simple interface to the user, zero-copy operation, low RAM consumption, and high data throughput. In particular, this research examines algorithms and techniques for enabling Vortex operation over the network, including both TCP/IP sockets and data-link RDMA (e.g., InfiniBand) interfaces. We developed a new producer-consumer memory stream abstraction presented as a Vortex stream split across two hosts, travelling through a hidden network communication layer to provide the illusion of writing a continuous stream of data directly into a window of memory on a remote machine, thereby enabling the creation of high-performance networking code and size-agnostic data transport under appropriate circumstances written as simply as an in-memory copy operation, overcoming complications normally inherent in the discrete nature of network packet transfer. While the resulting product is highly workable over standard IP-based internet networks, the design limitations of RDMA technology in interfacing with virtual memory prove to make Vortex streams a suboptimal abstraction for this programming platform, as its central appeal of zero-copy network transfers are rendered largely inaccessible. Alternative algorithms to enhance RDMA performance with Vortex are proposed for future study

Texas A&M Repository

Disk-to-Disk Data Transfer using A Software Defined Networking Solution

Author: Zulfiqar Junaid
Publication venue: Clemson University Libraries
Publication date: 01/05/2018
Field of study

There have been eﬀorts towards improving the network performance using software deﬁned net-working solutions. One such work is Steroid OpenFlow Service (SOS), which utilizes multiple parallel TCP connections to enhance the network performance transparently to the user. SOS has shown signiﬁcant improvements in the memory-to-memory data transfer throughput; however, it’s perfor-mance for disk-to-disk data transfer hasn’t been studied. For computing applications involving big data, the data ﬁles are stored on non-volatile storage devices separate from the computing servers. Before computing can occur, large volumes of data must be fetched from the “remote” storage devices to the computing server’s local storage device. Since hard drives are the most commonly adopted storage devices today, the process is often called “disk-to-disk” data transfer. For production high performance computing facilities, specialized high throughput data transfer software will be provided for users to copy the data ﬁrst to a data transfer node before copying to the computing server. Disk-to-Disk data transfer’s throughput performance depends on the network throughput be-tween servers and disk access performance between each server and its storage device. Due to large data sizes the storage devices are typically parallel ﬁle systems spanning multiple disks. Disk oper-ations in the disk-to-disk data transfer includes disk read and write operations. The read operation in the transfer is to read the data from the disks and store it in memory. The second step in the transfer is to send out the data to the network through the network interface. Data reaching the destination server is then stored to the disk. Data transfer is faced by multiple delays and is limited at each step of the transfer. To date, one commonly adopted data transfer solution is GridFTP developed by the Argonne National Laboratory. It requires custom application installations and conﬁgurations on the hosts. SOS, on the other hand, is a transparent network application without special user software. In this thesis, disk-to-disk data transfer performance is studied with both GridFTP and SOS. The thesis focuses on to two topics, one is the detailed analysis of transfer components for each tool and the second part consists of a systematic experiment to study the two. The experimentation and analysis of the results shows that conﬁguring the data nodes and network with correct parameters results in maximum performance for disk-to-disk data transfer. The GridFTP, for example, is able to get to close to 7Gbps by using four parallel connections with TCP buﬀer size of 16MB. It achieves the maximum performance by ﬁlling the network pipe which has 10Gbps end-to-end link with round trip time (RTT) of 53ms

Clemson University: TigerPrints

Container Resource Allocation versus Performance of Data-intensive Applications on Different Cloud Servers

Author: Anjam Khayam
Barrineau Geddings
Izard Ryan
Kar Snigdhaswin
Linduff Caleb
Mishra Prabodh
Wang Kuang-Ching
Wang Qing
Zulfiqar Junaid
Publication venue
Publication date: 13/11/2023
Field of study

In recent years, data-intensive applications have been increasingly deployed on cloud systems. Such applications utilize significant compute, memory, and I/O resources to process large volumes of data. Optimizing the performance and cost-efficiency for such applications is a non-trivial problem. The problem becomes even more challenging with the increasing use of containers, which are popular due to their lower operational overheads and faster boot speed at the cost of weaker resource assurances for the hosted applications. In this paper, two containerized data-intensive applications with very different performance objectives and resource needs were studied on cloud servers with Docker containers running on Intel Xeon E5 and AMD EPYC Rome multi-core processors with a range of CPU, memory, and I/O configurations. Primary findings from our experiments include: 1) Allocating multiple cores to a compute-intensive application can improve performance, but only if the cores do not contend for the same caches, and the optimal core counts depend on the specific workload; 2) allocating more memory to a memory-intensive application than its deterministic data workload does not further improve performance; however, 3) having multiple such memory-intensive containers on the same server can lead to cache and memory bus contention leading to significant and volatile performance degradation. The comparative observations on Intel and AMD servers provided insights into trade-offs between larger numbers of distributed chiplets interconnected with higher speed buses (AMD) and larger numbers of centrally integrated cores and caches with lesser speed buses (Intel). For the two types of applications studied, the more distributed caches and faster data buses have benefited the deployment of larger numbers of containers

arXiv.org e-Print Archive

High Speed Networking In The Multi-Core Era

Author: Wun Benjamin
Publication venue: Washington University Open Scholarship
Publication date: 01/01/2011
Field of study

High speed networking is a demanding task that has traditionally been performed in dedicated, purpose built hardware or specialized network processors. These platforms sacrifice flexibility or programmability in favor of performance. Recently, there has been much interest in using multi-core general purpose processors for this task, which have the advantage of being easily programmable and upgradeable. The best way to exploit these new architectures for networking is an open question that has been the subject of much recent research. In this dissertation, I explore the best way to exploit multi-core general purpose processors for packet processing applications. This includes both new architectural organizations for the processors as well as changes to the systems software. I intend to demonstrate the efficacy of these techniques by using them to build an open and extensible network security and monitoring platform that can out perform existing solutions

Washington University St. Louis: Open Scholarship