178 research outputs found

    Master of Science

    Get PDF
    thesisEfficient movement of massive amounts of data over high-speed networks at high throughput is essential for a modern-day in-memory storage system. In response to the growing needs of throughput and latency demands at scale, a new class of database systems was developed in recent years. The development of these systems was guided by increased access to high throughput, low latency network fabrics, and declining cost of Dynamic Random Access Memory (DRAM). These systems were designed with On-Line Transactional Processing (OLTP) workloads in mind, and, as a result, are optimized for fast dispatch and perform well under small request-response scenarios. However, massive server responses such as those for range queries and data migration for load balancing poses challenges for this design. This thesis analyzes the effects of large transfers on scale-out systems through the lens of a modern Network Interface Card (NIC). The present-day NIC offers new and exciting opportunities and challenges for large transfers, but using them efficiently requires smart data layout and concurrency control. We evaluated the impact of modern NICs in designing data layout by measuring transmit performance and full system impact by observing the effects of Direct Memory Access (DMA), Remote Direct Memory Access (RDMA), and caching improvements such as Intel® Data Direct I/O (DDIO). We discovered that use of techniques such as Zero Copy yield around 25% savings in CPU cycles and a 50% reduction in the memory bandwidth utilization on a server by using a client-assisted design with records that are not updated in place. We also set up experiments that underlined the bottlenecks in the current approach to data migration in RAMCloud and propose guidelines for a fast and efficient migration protocol for RAMCloud

    Implementation and comparison of iSCSI over RDMA

    Get PDF
    iSCSI is an emerging storage network technology that allows for block-level access to disk drives over a computer network. Since iSCSI runs over the very ubiquitous TCP/IP protocol it has many advantages over its more proprietary alternatives. Due to the recent movement toward 10 gigabit Ethernet, storage vendors are interested to see how this large increase in network bandwidth could benefit the iSCSI protocol. In order to make full use of the bandwidth provided by a 10 gigabit Ethernet link, specialized Remote Direct Memory Access hardware is being developed to offload processing and reduce the data-copy-overhead found in a standard TCP/IP network stack. This thesis focuses on the development of an iSCSI implementation that is capable of supporting this new hardware and the evaluation of its performance. This thesis depicts the approach used to implement the iSCSI Extensions for Remote Direct Memory Access (iSER) with the UNH iSCSI reference implementation. This approach involves a three step process: moving UNH-iSCSI from the Linux kernel to the Linux user-space, adding support for the iSER extensions to our user-space iSCSI and finally moving everything back into the Linux kernel. In addition to a description of the implementation, results are given that demonstrate the performance of the completed iSER-assisted iSCSI implementation

    Characterizing Computation-Communication Overlap in Message-Passing Systems

    Full text link

    GEAR: A GPU-Centric Experience Replay System for Large Reinforcement Learning Models

    Get PDF
    This paper introduces a distributed, GPU-centric experience replay system, GEAR, designed to perform scalable reinforcement learning (RL) with large sequence models (such as transformers). With such models, existing systems such as Reverb face considerable bottlenecks in memory, computation, and communication. GEAR, however, optimizes memory efficiency by enabling the memory resources on GPU servers (including host memory and device memory) to manage trajectory data. Furthermore, it facilitates decentralized GPU devices to expedite various trajectory selection strategies, circumventing computational bottlenecks. GEAR is equipped with GPU kernels capable of collecting trajectories using zero-copy access to host memory, along with remote-directed-memory access over InfiniBand, improving communication efficiency. Cluster experiments have shown that GEAR can achieve performance levels up to 6Ă— greater than Reverb when training state-of-the-art large RL models. GEAR is open-sourced at https://github.com/bigrl-team/gear

    Analyzing the impact of supporting out-of-order communication on in-order performance with iWARP

    Full text link

    A complete and efficient CUDA-sharing solution for HPC clusters

    Get PDF
    In this paper we detail the key features, architectural design, and implementation of rCUDA, an advanced framework to enable remote and transparent GPGPU acceleration in HPC clusters. rCUDA allows decoupling GPUs from nodes, forming pools of shared accelerators, which brings enhanced flexibility to cluster configurations. This opens the door to configurations with fewer accelerators than nodes, as well as permits a single node to exploit the whole set of GPUs installed in the cluster. In our proposal, CUDA applications can seamlessly interact with any GPU in the cluster, independently of its physical location. Thus, GPUs can be either distributed among compute nodes or concentrated in dedicated GPGPU servers, depending on the cluster administrator’s policy. This proposal leads to savings not only in space but also in energy, acquisition, and maintenance costs. The performance evaluation in this paper with a series of benchmarks and a production application clearly demonstrates the viability of this proposal. Concretely, experiments with the matrix–matrix product reveal excellent performance compared with regular executions on the local GPU; on a much more complex application, the GPU-accelerated LAMMPS, we attain up to 11x speedup employing 8 remote accelerators from a single node with respect to a 12-core CPU-only execution. GPGPU service interaction in compute nodes, remote acceleration in dedicated GPGPU servers, and data transfer performance of similar GPU virtualization frameworks are also evaluated. 2014 Elsevier B.V. All rights reserved.This work was supported by the Spanish Ministerio de Economia y Competitividad (MINECO) and by FEDER funds under Grant TIN2012-38341-004-01. It was also supported by MINECO, FEDER funds, under Grant TIN2011-23283, and by the Fundacion Caixa-Castello Bancaixa, Grant P11B2013-21. This work was also supported in part by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357. Authors are grateful for the generous support provided by Mellanox Technologies to the rCUDA Project. The authors would also like to thank Adrian Castello, member of The rCUDA Development Team, for his hard work on rCUDA.Peña Monferrer, AJ.; Reaño González, C.; Silla Jiménez, F.; Mayo Gual, R.; Quintana-Orti, ES.; Duato Marín, JF. (2014). A complete and efficient CUDA-sharing solution for HPC clusters. Parallel Computing. 40(10):574-588. https://doi.org/10.1016/j.parco.2014.09.011S574588401

    Scalable RDMA performance in PGAS languages

    Get PDF
    Partitioned global address space (PGAS) languages provide a unique programming model that can span shared-memory multiprocessor (SMP) architectures, distributed memory machines, or cluster ofSMPs. Users can program large scale machines with easy-to-use, shared memory paradigms. In order to exploit large scale machines efficiently, PGAS language implementations and their runtime system must be designed for scalability and performance. The IBM XLUPC compiler and runtime system provide a scalable design through the use of the shared variable directory (SVD). The SVD stores meta-information needed to access shared data. It is dereferenced, in the worst case, for every shared memory access, thus exposing a potential performance problem. In this paper we present a cache of remote addresses as an optimization that will reduce the SVD access overhead and allow the exploitation of native (remote) direct memory accesses. It results in a significant performance improvement while maintaining the run-time portability and scalability.Postprint (published version

    HDArray: Parallel Array Interface for Distributed Heterogeneous Devices

    Get PDF
    Heterogeneous clusters with nodes containing one or more accelerators, such as GPUs, have become common. While MPI provides a mechanism and management of interaddress space communication, and OpenCL provides a way to manage computation and communication within a process with access to heterogeneous computational resources, programmers are forced to write hybrid programs that manage the interaction of both of these systems. This paper describes an array programming interface that provides users with automatic or manual distributions of data and work. Using the distribution and information about what data is used and defined by kernels, communication among processes and among devices in a process is performed automatically. The interface provides a unified programming model to the user, thus simplifying program development
    • …
    corecore