178 research outputs found
Master of Science
thesisEfficient movement of massive amounts of data over high-speed networks at high throughput is essential for a modern-day in-memory storage system. In response to the growing needs of throughput and latency demands at scale, a new class of database systems was developed in recent years. The development of these systems was guided by increased access to high throughput, low latency network fabrics, and declining cost of Dynamic Random Access Memory (DRAM). These systems were designed with On-Line Transactional Processing (OLTP) workloads in mind, and, as a result, are optimized for fast dispatch and perform well under small request-response scenarios. However, massive server responses such as those for range queries and data migration for load balancing poses challenges for this design. This thesis analyzes the effects of large transfers on scale-out systems through the lens of a modern Network Interface Card (NIC). The present-day NIC offers new and exciting opportunities and challenges for large transfers, but using them efficiently requires smart data layout and concurrency control. We evaluated the impact of modern NICs in designing data layout by measuring transmit performance and full system impact by observing the effects of Direct Memory Access (DMA), Remote Direct Memory Access (RDMA), and caching improvements such as Intel® Data Direct I/O (DDIO). We discovered that use of techniques such as Zero Copy yield around 25% savings in CPU cycles and a 50% reduction in the memory bandwidth utilization on a server by using a client-assisted design with records that are not updated in place. We also set up experiments that underlined the bottlenecks in the current approach to data migration in RAMCloud and propose guidelines for a fast and efficient migration protocol for RAMCloud
Implementation and comparison of iSCSI over RDMA
iSCSI is an emerging storage network technology that allows for block-level access to disk drives over a computer network. Since iSCSI runs over the very ubiquitous TCP/IP protocol it has many advantages over its more proprietary alternatives. Due to the recent movement toward 10 gigabit Ethernet, storage vendors are interested to see how this large increase in network bandwidth could benefit the iSCSI protocol.
In order to make full use of the bandwidth provided by a 10 gigabit Ethernet link, specialized Remote Direct Memory Access hardware is being developed to offload processing and reduce the data-copy-overhead found in a standard TCP/IP network stack. This thesis focuses on the development of an iSCSI implementation that is capable of supporting this new hardware and the evaluation of its performance.
This thesis depicts the approach used to implement the iSCSI Extensions for Remote Direct Memory Access (iSER) with the UNH iSCSI reference implementation. This approach involves a three step process: moving UNH-iSCSI from the Linux kernel to the Linux user-space, adding support for the iSER extensions to our user-space iSCSI and finally moving everything back into the Linux kernel. In addition to a description of the implementation, results are given that demonstrate the performance of the completed iSER-assisted iSCSI implementation
GEAR: A GPU-Centric Experience Replay System for Large Reinforcement Learning Models
This paper introduces a distributed, GPU-centric experience replay system, GEAR, designed to perform scalable reinforcement learning (RL) with large sequence models (such as transformers). With such models, existing systems such as Reverb face considerable bottlenecks in memory, computation, and communication. GEAR, however, optimizes memory efficiency by enabling the memory resources on GPU servers (including host memory and device memory) to manage trajectory data. Furthermore, it facilitates decentralized GPU devices to expedite various trajectory selection strategies, circumventing computational bottlenecks. GEAR is equipped with GPU kernels capable of collecting trajectories using zero-copy access to host memory, along with remote-directed-memory access over InfiniBand, improving communication efficiency. Cluster experiments have shown that GEAR can achieve performance levels up to 6Ă— greater than Reverb when training state-of-the-art large RL models. GEAR is open-sourced at https://github.com/bigrl-team/gear
A complete and efficient CUDA-sharing solution for HPC clusters
In this paper we detail the key features, architectural design, and implementation of rCUDA,
an advanced framework to enable remote and transparent GPGPU acceleration in HPC
clusters. rCUDA allows decoupling GPUs from nodes, forming pools of shared accelerators,
which brings enhanced flexibility to cluster configurations. This opens the door to configurations
with fewer accelerators than nodes, as well as permits a single node to exploit the
whole set of GPUs installed in the cluster. In our proposal, CUDA applications can seamlessly
interact with any GPU in the cluster, independently of its physical location. Thus,
GPUs can be either distributed among compute nodes or concentrated in dedicated GPGPU
servers, depending on the cluster administrator’s policy. This proposal leads to savings not
only in space but also in energy, acquisition, and maintenance costs. The performance evaluation
in this paper with a series of benchmarks and a production application clearly demonstrates
the viability of this proposal. Concretely, experiments with the matrix–matrix
product reveal excellent performance compared with regular executions on the local
GPU; on a much more complex application, the GPU-accelerated LAMMPS, we attain up
to 11x speedup employing 8 remote accelerators from a single node with respect to a
12-core CPU-only execution. GPGPU service interaction in compute nodes, remote acceleration
in dedicated GPGPU servers, and data transfer performance of similar GPU virtualization
frameworks are also evaluated.
2014 Elsevier B.V. All rights reserved.This work was supported by the Spanish Ministerio de Economia y Competitividad (MINECO) and by FEDER funds under Grant TIN2012-38341-004-01. It was also supported by MINECO, FEDER funds, under Grant TIN2011-23283, and by the Fundacion Caixa-Castello Bancaixa, Grant P11B2013-21. This work was also supported in part by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357. Authors are grateful for the generous support provided by Mellanox Technologies to the rCUDA Project. The authors would also like to thank Adrian Castello, member of The rCUDA Development Team, for his hard work on rCUDA.Peña Monferrer, AJ.; Reaño González, C.; Silla JimĂ©nez, F.; Mayo Gual, R.; Quintana-Orti, ES.; Duato MarĂn, JF. (2014). A complete and efficient CUDA-sharing solution for HPC clusters. Parallel Computing. 40(10):574-588. https://doi.org/10.1016/j.parco.2014.09.011S574588401
Scalable RDMA performance in PGAS languages
Partitioned global address space (PGAS) languages provide a unique programming model that can span shared-memory multiprocessor (SMP) architectures, distributed memory machines, or cluster ofSMPs. Users can program large scale machines with easy-to-use, shared memory paradigms. In order to exploit large scale machines efficiently, PGAS language implementations and their runtime system must be designed for scalability and performance. The IBM XLUPC compiler and runtime system provide a scalable design through the use of the shared variable directory (SVD). The SVD stores meta-information needed to access shared data. It is dereferenced, in the worst case, for every shared memory access, thus exposing a potential performance problem. In this paper we present a cache of remote addresses as an optimization that will reduce the SVD access overhead and allow the exploitation of native (remote) direct memory accesses. It results in a significant performance improvement while maintaining the run-time portability and scalability.Postprint (published version
HDArray: Parallel Array Interface for Distributed Heterogeneous Devices
Heterogeneous clusters with nodes containing one or more accelerators, such as GPUs, have become common. While MPI provides a mechanism and management of interaddress space communication, and OpenCL provides a way to manage computation and communication within a process with access to heterogeneous computational resources, programmers are forced to write hybrid programs that manage the interaction of both of these systems. This paper describes an array programming interface that provides users with automatic or manual distributions of data and work. Using the distribution and information about what data is used and defined by kernels, communication among processes and among devices in a process is performed automatically. The interface provides a unified programming model to the user, thus simplifying program development
- …