119 research outputs found
Network acceleration techniques
Splintered offloading techniques with receive batch processing are described for network acceleration. Such techniques offload specific functionality to a NIC while maintaining the bulk of the protocol processing in the host operating system ("OS"). The resulting protocol implementation allows the application to bypass the protocol processing of the received data. Such can be accomplished this by moving data from the NIC directly to the application through direct memory access ("DMA") and batch processing the receive headers in the host OS when the host OS is interrupted to perform other work. Batch processing receive headers allows the data path to be separated from the control path. Unlike operating system bypass, however, the operating system still fully manages the network resource and has relevant feedback about traffic and flows. Embodiments of the present disclosure can therefore address the challenges of networks with extreme bandwidth delay products (BWDP)
Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses
Graph Neural Networks (GNNs) are emerging as a powerful tool for learning
from graph-structured data and performing sophisticated inference tasks in
various application domains. Although GNNs have been shown to be effective on
modest-sized graphs, training them on large-scale graphs remains a significant
challenge due to lack of efficient data access and data movement methods.
Existing frameworks for training GNNs use CPUs for graph sampling and feature
aggregation, while the training and updating of model weights are executed on
GPUs. However, our in-depth profiling shows the CPUs cannot achieve the
throughput required to saturate GNN model training throughput, causing gross
under-utilization of expensive GPU resources. Furthermore, when the graph and
its embeddings do not fit in the CPU memory, the overhead introduced by the
operating system, say for handling page-faults, comes in the critical path of
execution.
To address these issues, we propose the GPU Initiated Direct Storage Access
(GIDS) dataloader, to enable GPU-oriented GNN training for large-scale graphs
while efficiently utilizing all hardware resources, such as CPU memory,
storage, and GPU memory with a hybrid data placement strategy. By enabling GPU
threads to fetch feature vectors directly from storage, GIDS dataloader solves
the memory capacity problem for GPU-oriented GNN training. Moreover, GIDS
dataloader leverages GPU parallelism to tolerate storage latency and eliminates
expensive page-fault overhead. Doing so enables us to design novel
optimizations for exploiting locality and increasing effective bandwidth for
GNN training. Our evaluation using a single GPU on terabyte-scale GNN datasets
shows that GIDS dataloader accelerates the overall DGL GNN training pipeline by
up to 392X when compared to the current, state-of-the-art DGL dataloader.Comment: Under Submission. Source code:
https://github.com/jeongminpark417/GID
NaNet:a low-latency NIC enabling GPU-based, real-time low level trigger systems
We implemented the NaNet FPGA-based PCI2 Gen2 GbE/APElink NIC, featuring
GPUDirect RDMA capabilities and UDP protocol management offloading. NaNet is
able to receive a UDP input data stream from its GbE interface and redirect it,
without any intermediate buffering or CPU intervention, to the memory of a
Fermi/Kepler GPU hosted on the same PCIe bus, provided that the two devices
share the same upstream root complex. Synthetic benchmarks for latency and
bandwidth are presented. We describe how NaNet can be employed in the prototype
of the GPU-based RICH low-level trigger processor of the NA62 CERN experiment,
to implement the data link between the TEL62 readout boards and the low level
trigger processor. Results for the throughput and latency of the integrated
system are presented and discussed.Comment: Proceedings for the 20th International Conference on Computing in
High Energy and Nuclear Physics (CHEP
Network Vortex: Distributed Virtual Memory for Streaming Applications
Explosive growth of the Internet, cluster computing, and storage technology has led to generation of enormous volumes of information and the need for scalable data computing. One of the central frameworks for such analysis is MapReduce, which is a programming platform for processing streaming data in external/distributed memory. Despite a significant public effort, open-source implementations of MapReduce (e.g., Hadoop, Spark) are complicated, bulky, and inefficient. To overcome this problem, we explore employing and expanding upon a recent a C/C++ programming abstraction called Vortex that offers a simple interface to the user, zero-copy operation, low RAM consumption, and high data throughput. In particular, this research examines algorithms and techniques for enabling Vortex operation over the network, including both TCP/IP sockets and data-link RDMA (e.g., InfiniBand) interfaces. We developed a new producer-consumer memory stream abstraction presented as a Vortex stream split across two hosts, travelling through a hidden network communication layer to provide the illusion of writing a continuous stream of data directly into a window of memory on a remote machine, thereby enabling the creation of high-performance networking code and size-agnostic data transport under appropriate circumstances written as simply as an in-memory copy operation, overcoming complications normally inherent in the discrete nature of network packet transfer. While the resulting product is highly workable over standard IP-based internet networks, the design limitations of RDMA technology in interfacing with virtual memory prove to make Vortex streams a suboptimal abstraction for this programming platform, as its central appeal of zero-copy network transfers are rendered largely inaccessible. Alternative algorithms to enhance RDMA performance with Vortex are proposed for future study
Disk-to-Disk Data Transfer using A Software Defined Networking Solution
There have been eļ¬orts towards improving the network performance using software deļ¬ned net-working solutions. One such work is Steroid OpenFlow Service (SOS), which utilizes multiple parallel TCP connections to enhance the network performance transparently to the user. SOS has shown signiļ¬cant improvements in the memory-to-memory data transfer throughput; however, itās perfor-mance for disk-to-disk data transfer hasnāt been studied. For computing applications involving big data, the data ļ¬les are stored on non-volatile storage devices separate from the computing servers. Before computing can occur, large volumes of data must be fetched from the āremoteā storage devices to the computing serverās local storage device. Since hard drives are the most commonly adopted storage devices today, the process is often called ādisk-to-diskā data transfer. For production high performance computing facilities, specialized high throughput data transfer software will be provided for users to copy the data ļ¬rst to a data transfer node before copying to the computing server. Disk-to-Disk data transferās throughput performance depends on the network throughput be-tween servers and disk access performance between each server and its storage device. Due to large data sizes the storage devices are typically parallel ļ¬le systems spanning multiple disks. Disk oper-ations in the disk-to-disk data transfer includes disk read and write operations. The read operation in the transfer is to read the data from the disks and store it in memory. The second step in the transfer is to send out the data to the network through the network interface. Data reaching the destination server is then stored to the disk. Data transfer is faced by multiple delays and is limited at each step of the transfer. To date, one commonly adopted data transfer solution is GridFTP developed by the Argonne National Laboratory. It requires custom application installations and conļ¬gurations on the hosts. SOS, on the other hand, is a transparent network application without special user software. In this thesis, disk-to-disk data transfer performance is studied with both GridFTP and SOS. The thesis focuses on to two topics, one is the detailed analysis of transfer components for each tool and the second part consists of a systematic experiment to study the two. The experimentation and analysis of the results shows that conļ¬guring the data nodes and network with correct parameters results in maximum performance for disk-to-disk data transfer. The GridFTP, for example, is able to get to close to 7Gbps by using four parallel connections with TCP buļ¬er size of 16MB. It achieves the maximum performance by ļ¬lling the network pipe which has 10Gbps end-to-end link with round trip time (RTT) of 53ms
Container Resource Allocation versus Performance of Data-intensive Applications on Different Cloud Servers
In recent years, data-intensive applications have been increasingly deployed
on cloud systems. Such applications utilize significant compute, memory, and
I/O resources to process large volumes of data. Optimizing the performance and
cost-efficiency for such applications is a non-trivial problem. The problem
becomes even more challenging with the increasing use of containers, which are
popular due to their lower operational overheads and faster boot speed at the
cost of weaker resource assurances for the hosted applications. In this paper,
two containerized data-intensive applications with very different performance
objectives and resource needs were studied on cloud servers with Docker
containers running on Intel Xeon E5 and AMD EPYC Rome multi-core processors
with a range of CPU, memory, and I/O configurations. Primary findings from our
experiments include: 1) Allocating multiple cores to a compute-intensive
application can improve performance, but only if the cores do not contend for
the same caches, and the optimal core counts depend on the specific workload;
2) allocating more memory to a memory-intensive application than its
deterministic data workload does not further improve performance; however, 3)
having multiple such memory-intensive containers on the same server can lead to
cache and memory bus contention leading to significant and volatile performance
degradation. The comparative observations on Intel and AMD servers provided
insights into trade-offs between larger numbers of distributed chiplets
interconnected with higher speed buses (AMD) and larger numbers of centrally
integrated cores and caches with lesser speed buses (Intel). For the two types
of applications studied, the more distributed caches and faster data buses have
benefited the deployment of larger numbers of containers
High Speed Networking In The Multi-Core Era
High speed networking is a demanding task that has traditionally been performed in dedicated, purpose built hardware or specialized network processors. These platforms sacrifice flexibility or programmability in favor of performance. Recently, there has been much interest in using multi-core general purpose processors for this task, which have the advantage of being easily programmable and upgradeable. The best way to exploit these new architectures for networking is an open question that has been the subject of much recent research. In this dissertation, I explore the best way to exploit multi-core general purpose processors for packet processing applications. This includes both new architectural organizations for the processors as well as changes to the systems software. I intend to demonstrate the efficacy of these techniques by using them to build an open and extensible network security and monitoring platform that can out perform existing solutions
- ā¦