Search CORE

178 research outputs found

Master of Science

Author: Kesavan Aniraj
Publication venue: University of Utah
Publication date: 01/01/2017
Field of study

thesisEfficient movement of massive amounts of data over high-speed networks at high throughput is essential for a modern-day in-memory storage system. In response to the growing needs of throughput and latency demands at scale, a new class of database systems was developed in recent years. The development of these systems was guided by increased access to high throughput, low latency network fabrics, and declining cost of Dynamic Random Access Memory (DRAM). These systems were designed with On-Line Transactional Processing (OLTP) workloads in mind, and, as a result, are optimized for fast dispatch and perform well under small request-response scenarios. However, massive server responses such as those for range queries and data migration for load balancing poses challenges for this design. This thesis analyzes the effects of large transfers on scale-out systems through the lens of a modern Network Interface Card (NIC). The present-day NIC offers new and exciting opportunities and challenges for large transfers, but using them efficiently requires smart data layout and concurrency control. We evaluated the impact of modern NICs in designing data layout by measuring transmit performance and full system impact by observing the effects of Direct Memory Access (DMA), Remote Direct Memory Access (RDMA), and caching improvements such as Intel® Data Direct I/O (DDIO). We discovered that use of techniques such as Zero Copy yield around 25% savings in CPU cycles and a 50% reduction in the memory bandwidth utilization on a server by using a client-assisted design with records that are not updated in place. We also set up experiments that underlined the bottlenecks in the current approach to data migration in RAMCloud and propose guidelines for a fast and efficient migration protocol for RAMCloud

The University of Utah: J. Willard Marriott Digital Library

Implementation and comparison of iSCSI over RDMA

Author: Burns Ethan
Publication venue: University of New Hampshire Scholars\u27 Repository
Publication date: 01/01/2008
Field of study

iSCSI is an emerging storage network technology that allows for block-level access to disk drives over a computer network. Since iSCSI runs over the very ubiquitous TCP/IP protocol it has many advantages over its more proprietary alternatives. Due to the recent movement toward 10 gigabit Ethernet, storage vendors are interested to see how this large increase in network bandwidth could benefit the iSCSI protocol. In order to make full use of the bandwidth provided by a 10 gigabit Ethernet link, specialized Remote Direct Memory Access hardware is being developed to offload processing and reduce the data-copy-overhead found in a standard TCP/IP network stack. This thesis focuses on the development of an iSCSI implementation that is capable of supporting this new hardware and the evaluation of its performance. This thesis depicts the approach used to implement the iSCSI Extensions for Remote Direct Memory Access (iSER) with the UNH iSCSI reference implementation. This approach involves a three step process: moving UNH-iSCSI from the Linux kernel to the Linux user-space, adding support for the iSER extensions to our user-space iSCSI and finally moving everything back into the Linux kernel. In addition to a description of the implementation, results are given that demonstrate the performance of the completed iSER-assisted iSCSI implementation

UNH Scholars' Repository

Characterizing Computation-Communication Overlap in Message-Passing Systems

Author
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date
Field of study

Crossref

GEAR: A GPU-Centric Experience Replay System for Large Reinforcement Learning Models

Author: He C
Mai L
Sit MK
Wang H
Wang J
Wen Y
Yang Y
Zhang W
Publication venue: ML Research Press
Publication date: 29/07/2023
Field of study

This paper introduces a distributed, GPU-centric experience replay system, GEAR, designed to perform scalable reinforcement learning (RL) with large sequence models (such as transformers). With such models, existing systems such as Reverb face considerable bottlenecks in memory, computation, and communication. GEAR, however, optimizes memory efficiency by enabling the memory resources on GPU servers (including host memory and device memory) to manage trajectory data. Furthermore, it facilitates decentralized GPU devices to expedite various trajectory selection strategies, circumventing computational bottlenecks. GEAR is equipped with GPU kernels capable of collecting trajectories using zero-copy access to host memory, along with remote-directed-memory access over InfiniBand, improving communication efficiency. Cluster experiments have shown that GEAR can achieve performance levels up to 6× greater than Reverb when training state-of-the-art large RL models. GEAR is open-sourced at https://github.com/bigrl-team/gear

UCL Discovery

Analyzing the impact of supporting out-of-order communication on in-order performance with iWARP

Author
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2007
Field of study

Crossref

A complete and efficient CUDA-sharing solution for HPC clusters

Author: Antonio J. Peña
Blythe
Carlos Reaño
Dongarra
Enrique S. Quintana-Ortí
Federico Silla
Giunta
Gupta
José Duato
Kegel
Kim
Li
Liu
Liu
Plimpton
Rafael Mayo
Ravi
Shi
Shreiner
Vouzis
Wang
Wu
Yoo
Young
Publication venue: 'Elsevier BV'
Publication date: 01/01/2014
Field of study

In this paper we detail the key features, architectural design, and implementation of rCUDA, an advanced framework to enable remote and transparent GPGPU acceleration in HPC clusters. rCUDA allows decoupling GPUs from nodes, forming pools of shared accelerators, which brings enhanced flexibility to cluster configurations. This opens the door to configurations with fewer accelerators than nodes, as well as permits a single node to exploit the whole set of GPUs installed in the cluster. In our proposal, CUDA applications can seamlessly interact with any GPU in the cluster, independently of its physical location. Thus, GPUs can be either distributed among compute nodes or concentrated in dedicated GPGPU servers, depending on the cluster administrator’s policy. This proposal leads to savings not only in space but also in energy, acquisition, and maintenance costs. The performance evaluation in this paper with a series of benchmarks and a production application clearly demonstrates the viability of this proposal. Concretely, experiments with the matrix–matrix product reveal excellent performance compared with regular executions on the local GPU; on a much more complex application, the GPU-accelerated LAMMPS, we attain up to 11x speedup employing 8 remote accelerators from a single node with respect to a 12-core CPU-only execution. GPGPU service interaction in compute nodes, remote acceleration in dedicated GPGPU servers, and data transfer performance of similar GPU virtualization frameworks are also evaluated. 2014 Elsevier B.V. All rights reserved.This work was supported by the Spanish Ministerio de Economia y Competitividad (MINECO) and by FEDER funds under Grant TIN2012-38341-004-01. It was also supported by MINECO, FEDER funds, under Grant TIN2011-23283, and by the Fundacion Caixa-Castello Bancaixa, Grant P11B2013-21. This work was also supported in part by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357. Authors are grateful for the generous support provided by Mellanox Technologies to the rCUDA Project. The authors would also like to thank Adrian Castello, member of The rCUDA Development Team, for his hard work on rCUDA.Peña Monferrer, AJ.; Reaño González, C.; Silla Jiménez, F.; Mayo Gual, R.; Quintana-Orti, ES.; Duato Marín, JF. (2014). A complete and efficient CUDA-sharing solution for HPC clusters. Parallel Computing. 40(10):574-588. https://doi.org/10.1016/j.parco.2014.09.011S574588401

Queen's University Belfast Research Portal

Crossref

Repositori Institucional de la Universitat Jaume I

RiuNet

Scalable RDMA performance in PGAS languages

Author: Almási George
Cortés Toni
Farreras Esclusa Montserrat
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2009
Field of study

Partitioned global address space (PGAS) languages provide a unique programming model that can span shared-memory multiprocessor (SMP) architectures, distributed memory machines, or cluster ofSMPs. Users can program large scale machines with easy-to-use, shared memory paradigms. In order to exploit large scale machines efficiently, PGAS language implementations and their runtime system must be designed for scalability and performance. The IBM XLUPC compiler and runtime system provide a scalable design through the use of the shared variable directory (SVD). The SVD stores meta-information needed to access shared data. It is dereferenced, in the worst case, for every shared memory access, thus exposing a potential performance problem. In this paper we present a cache of remote addresses as an optimization that will reduce the SVD access overhead and allow the exploitation of native (remote) direct memory accesses. It results in a significant performance improvement while maintaining the run-time portability and scalability.Postprint (published version

UPCommons. Portal del coneixement obert de la UPC

HDArray: Parallel Array Interface for Distributed Heterogeneous Devices

Author: A Klöckner
J Lee
J Towns
M Viñas
P Sakdhnagool
S Ernsting
W Gropp
Publication venue: 'Purdue University (bepress)'
Publication date: 26/03/2018
Field of study

Heterogeneous clusters with nodes containing one or more accelerators, such as GPUs, have become common. While MPI provides a mechanism and management of interaddress space communication, and OpenCL provides a way to manage computation and communication within a process with access to heterogeneous computational resources, programmers are forced to write hybrid programs that manage the interaction of both of these systems. This paper describes an array programming interface that provides users with automatic or manual distributions of data and work. Using the distribution and information about what data is used and defined by kernels, communication among processes and among devices in a process is performed automatically. The interface provides a unified programming model to the user, thus simplifying program development

arXiv.org e-Print Archive

Crossref

Purdue E-Pubs