316 research outputs found

    The Impact of RDMA on Agreement

    Full text link
    Remote Direct Memory Access (RDMA) is becoming widely available in data centers. This technology allows a process to directly read and write the memory of a remote host, with a mechanism to control access permissions. In this paper, we study the fundamental power of these capabilities. We consider the well-known problem of achieving consensus despite failures, and find that RDMA can improve the inherent trade-off in distributed computing between failure resilience and performance. Specifically, we show that RDMA allows algorithms that simultaneously achieve high resilience and high performance, while traditional algorithms had to choose one or another. With Byzantine failures, we give an algorithm that only requires n≥2fP+1n \geq 2f_P + 1 processes (where fPf_P is the maximum number of faulty processes) and decides in two (network) delays in common executions. With crash failures, we give an algorithm that only requires n≥fP+1n \geq f_P + 1 processes and also decides in two delays. Both algorithms tolerate a minority of memory failures inherent to RDMA, and they provide safety in asynchronous systems and liveness with standard additional assumptions.Comment: Full version of PODC'19 paper, strengthened broadcast algorith

    Towards Scalable OLTP Over Fast Networks

    Get PDF
    Online Transaction Processing (OLTP) underpins real-time data processing in many mission-critical applications, from banking to e-commerce. These applications typically issue short-duration, latency-sensitive transactions that demand immediate processing. High-volume applications, such as Alibaba's e-commerce platform, achieve peak transaction rates as high as 70 million transactions per second, exceeding the capacity of a single machine. Instead, distributed OLTP database management systems (DBMS) are deployed across multiple powerful machines. Historically, such distributed OLTP DBMSs have been primarily designed to avoid network communication, a paradigm largely unchanged since the 1980s. However, fast networks challenge the conventional belief that network communication is the main bottleneck. In particular, emerging network technologies, like Remote Direct Memory Access (RDMA), radically alter how data can be accessed over a network. RDMA's primitives allow direct access to the memory of a remote machine within an order of magnitude of local memory access. This development invalidates the notion that network communication is the primary bottleneck. Given that traditional distributed database systems have been designed with the premise that the network is slow, they cannot efficiently exploit these fast network primitives, which requires us to reconsider how we design distributed OLTP systems. This thesis focuses on the challenges RDMA presents and its implications on the design of distributed OLTP systems. First, we examine distributed architectures to understand data access patterns and scalability in modern OLTP systems. Drawing on these insights, we advocate a distributed storage engine optimized for high-speed networks. The storage engine serves as the foundation of a database, ensuring efficient data access through three central components: indexes, synchronization primitives, and buffer management (caching). With the introduction of RDMA, the landscape of data access has undergone a significant transformation. This requires a comprehensive redesign of the storage engine components to exploit the potential of RDMA and similar high-speed network technologies. Thus, as the second contribution, we design RDMA-optimized tree-based indexes — especially applicable for disaggregated databases to access remote data efficiently. We then turn our attention to the unique challenges of RDMA. One-sided RDMA, one of the network primitives introduced by RDMA, presents a performance advantage in enabling remote memory access while bypassing the remote CPU and the operating system. This allows the remote CPU to process transactions uninterrupted, with no requirement to be on hand for network communication. However, that way, specialized one-sided RDMA synchronization primitives are required since traditional CPU-driven primitives are bypassed. We found that existing RDMA one-sided synchronization schemes are unscalable or, even worse, fail to synchronize correctly, leading to hard-to-detect data corruption. As our third contribution, we address this issue by offering guidelines to build scalable and correct one-sided RDMA synchronization primitives. Finally, recognizing that maintaining all data in memory becomes economically unattractive, we propose a distributed buffer manager design that efficiently utilizes cost-effective NVMe flash storage. By leveraging low-latency RDMA messages, our buffer manager provides a transparent memory abstraction, accessing the aggregated DRAM and NVMe storage across nodes. Central to our approach is a distributed caching protocol that dynamically caches data. With this approach, our system can outperform RDMA-enabled in-memory distributed databases while managing larger-than-memory datasets efficiently

    Master of Science

    Get PDF
    thesisEfficient movement of massive amounts of data over high-speed networks at high throughput is essential for a modern-day in-memory storage system. In response to the growing needs of throughput and latency demands at scale, a new class of database systems was developed in recent years. The development of these systems was guided by increased access to high throughput, low latency network fabrics, and declining cost of Dynamic Random Access Memory (DRAM). These systems were designed with On-Line Transactional Processing (OLTP) workloads in mind, and, as a result, are optimized for fast dispatch and perform well under small request-response scenarios. However, massive server responses such as those for range queries and data migration for load balancing poses challenges for this design. This thesis analyzes the effects of large transfers on scale-out systems through the lens of a modern Network Interface Card (NIC). The present-day NIC offers new and exciting opportunities and challenges for large transfers, but using them efficiently requires smart data layout and concurrency control. We evaluated the impact of modern NICs in designing data layout by measuring transmit performance and full system impact by observing the effects of Direct Memory Access (DMA), Remote Direct Memory Access (RDMA), and caching improvements such as Intel® Data Direct I/O (DDIO). We discovered that use of techniques such as Zero Copy yield around 25% savings in CPU cycles and a 50% reduction in the memory bandwidth utilization on a server by using a client-assisted design with records that are not updated in place. We also set up experiments that underlined the bottlenecks in the current approach to data migration in RAMCloud and propose guidelines for a fast and efficient migration protocol for RAMCloud

    Building a Framework for High-performance In-memory Message-Oriented Middleware

    Get PDF
    Message-Oriented Middleware (MOM) is a popular class of software used in many distributed applications, ranging from business systems and social networks to gaming and streaming media services. As workloads continue to grow both in terms of the number of users and the amount of content, modern MOM systems face increasing demands in terms of performance and scalability. Recent advances in networking such as Remote Direct Memory Access (RDMA) offer a more efficient data transfer mechanism compared to traditional kernel-level socket networking used by existing widely-used MOM systems. Unfortunately, RDMA’s complex interface has made it difficult for MOM systems to utilize its capabilities. In this thesis, we introduce a framework called RocketBufs, which provides abstractions and interfaces for constructing high-performance MOM systems. Applications implemented using RocketBufs produce and consume data using regions of memory called buffers while the framework is responsible for transmitting, receiving and synchronizing buffer access. RocketBufs’ buffer abstraction is designed to work efficiently with different transport protocols, allowing messages to be distributed using RDMA or TCP using the same APIs (i.e., by simply changing a configuration file). We demonstrate the utility and evaluate the performance of RocketBufs by using it to implement a publish/subscribe system called RBMQ. We compare it against two widely-used, industry-grade MOM systems, namely RabbitMQ and Redis. Our evaluations show that when using TCP, RBMQ achieves up to 1.9 times higher messaging throughput than RabbitMQ, a message queuing system with an equivalent flow control scheme. When RDMA is used, RBMQ shows significant gains in messaging throughput (up to 3.7 times higher than RabbitMQ and up to 1.7 times higher than Redis), as well as reductions in median delivery latency (up to 81% lower than RabbitMQ and 47% lower than Redis). In addition, on RBMQ subscriber hosts configured to use RDMA, data transfers occur with negligible CPU overhead regardless of the amount of data being transferred. This allows CPU resources to be used for other purposes like processing data. To further demonstrate the flexibility of RocketBufs, we use it to build a live streaming video application by integrating RocketBufs into a web server to receive disseminated video data. When compared with the same application built with Redis, the RocketBufs-based dissemination host achieves live streaming throughput up to 73% higher while disseminating data, and the RocketBufs-based web server shows a reduction of up to 95% in CPU utilization, allowing for up to 55% more concurrent viewers to be serviced

    Datacenter Traffic Control: Understanding Techniques and Trade-offs

    Get PDF
    Datacenters provide cost-effective and flexible access to scalable compute and storage resources necessary for today's cloud computing needs. A typical datacenter is made up of thousands of servers connected with a large network and usually managed by one operator. To provide quality access to the variety of applications and services hosted on datacenters and maximize performance, it deems necessary to use datacenter networks effectively and efficiently. Datacenter traffic is often a mix of several classes with different priorities and requirements. This includes user-generated interactive traffic, traffic with deadlines, and long-running traffic. To this end, custom transport protocols and traffic management techniques have been developed to improve datacenter network performance. In this tutorial paper, we review the general architecture of datacenter networks, various topologies proposed for them, their traffic properties, general traffic control challenges in datacenters and general traffic control objectives. The purpose of this paper is to bring out the important characteristics of traffic control in datacenters and not to survey all existing solutions (as it is virtually impossible due to massive body of existing research). We hope to provide readers with a wide range of options and factors while considering a variety of traffic control mechanisms. We discuss various characteristics of datacenter traffic control including management schemes, transmission control, traffic shaping, prioritization, load balancing, multipathing, and traffic scheduling. Next, we point to several open challenges as well as new and interesting networking paradigms. At the end of this paper, we briefly review inter-datacenter networks that connect geographically dispersed datacenters which have been receiving increasing attention recently and pose interesting and novel research problems.Comment: Accepted for Publication in IEEE Communications Surveys and Tutorial
    • …
    corecore