90 research outputs found

    On the Impact of Memory Allocation on High-Performance Query Processing

    Full text link
    Somewhat surprisingly, the behavior of analytical query engines is crucially affected by the dynamic memory allocator used. Memory allocators highly influence performance, scalability, memory efficiency and memory fairness to other processes. In this work, we provide the first comprehensive experimental analysis on the impact of memory allocation for high-performance query engines. We test five state-of-the-art dynamic memory allocators and discuss their strengths and weaknesses within our DBMS. The right allocator can increase the performance of TPC-DS (SF 100) by 2.7x on a 4-socket Intel Xeon server

    Growth of relational model: Interdependence and complementary to big data

    Get PDF
    A database management system is a constant application of science that provides a platform for the creation, movement, and use of voluminous data. The area has witnessed a series of developments and technological advancements from its conventional structured database to the recent buzzword, bigdata. This paper aims to provide a complete model of a relational database that is still being widely used because of its well known ACID properties namely, atomicity, consistency, integrity and durability. Specifically, the objective of this paper is to highlight the adoption of relational model approaches by bigdata techniques. Towards addressing the reason for this in corporation, this paper qualitatively studied the advancements done over a while on the relational data model. First, the variations in the data storage layout are illustrated based on the needs of the application. Second, quick data retrieval techniques like indexing, query processing and concurrency control methods are revealed. The paper provides vital insights to appraise the efficiency of the structured database in the unstructured environment, particularly when both consistency and scalability become an issue in the working of the hybrid transactional and analytical database management system

    Scaling In-Memory databases on multicores

    Get PDF
    Current computer systems have evolved from featuring only a single processing unit and limited RAM, in the order of kilobytes or few megabytes, to include several multicore processors, o↵ering in the order of several tens of concurrent execution contexts, and have main memory in the order of several tens to hundreds of gigabytes. This allows to keep all data of many applications in the main memory, leading to the development of inmemory databases. Compared to disk-backed databases, in-memory databases (IMDBs) are expected to provide better performance by incurring in less I/O overhead. In this dissertation, we present a scalability study of two general purpose IMDBs on multicore systems. The results show that current general purpose IMDBs do not scale on multicores, due to contention among threads running concurrent transactions. In this work, we explore di↵erent direction to overcome the scalability issues of IMDBs in multicores, while enforcing strong isolation semantics. First, we present a solution that requires no modification to either database systems or to the applications, called MacroDB. MacroDB replicates the database among several engines, using a master-slave replication scheme, where update transactions execute on the master, while read-only transactions execute on slaves. This reduces contention, allowing MacroDB to o↵er scalable performance under read-only workloads, while updateintensive workloads su↵er from performance loss, when compared to the standalone engine. Second, we delve into the database engine and identify the concurrency control mechanism used by the storage sub-component as a scalability bottleneck. We then propose a new locking scheme that allows the removal of such mechanisms from the storage sub-component. This modification o↵ers performance improvement under all workloads, when compared to the standalone engine, while scalability is limited to read-only workloads. Next we addressed the scalability limitations for update-intensive workloads, and propose the reduction of locking granularity from the table level to the attribute level. This further improved performance for intensive and moderate update workloads, at a slight cost for read-only workloads. Scalability is limited to intensive-read and read-only workloads. Finally, we investigate the impact applications have on the performance of database systems, by studying how operation order inside transactions influences the database performance. We then propose a Read before Write (RbW) interaction pattern, under which transaction perform all read operations before executing write operations. The RbW pattern allowed TPC-C to achieve scalable performance on our modified engine for all workloads. Additionally, the RbW pattern allowed our modified engine to achieve scalable performance on multicores, almost up to the total number of cores, while enforcing strong isolation

    Scalable and dynamically balanced shared-everything OLTP with physiological partitioning

    Get PDF
    Scaling the performance of shared-everything transaction processing systems to highly parallel multicore hardware remains a challenge for database system designers. Recent proposals alleviate locking and logging bottlenecks in the system, leaving page latching as the next potential problem. To tackle the page latching problem, we propose physiological partitioning (PLP). PLP applies logical-only partitioning, maintaining the desired properties of sharedeverything designs, and introduces a multi-rooted B+Tree index structure (MRBTree) that enables the partitioning of the accesses at the physical page level. Logical partitioning and MRBTrees together ensure that all accesses to a given index page come from a single thread and, hence, can be entirely latch free; an extended design makes heap page accesses thread private as well. Moreover, MRBTrees offer an infrastructure for easy repartitioning and allow us to have a lightweight dynamic load balancing mechanism (DLB) on top of PLP. Profiling a PLP prototype running on different multicore machines shows that it acquires 85 and 68%fewer contentious critical sections, respectively, than an optimized conventional design and one based on logical-only partitioning. PLP also improves performance up to almost 50 % over the existing systems, while DLB enhances the system with rapid and robust behavior in both detecting and handling load imbalance

    Transactions Chasing Scalability and Instruction Locality on Multicores

    Get PDF
    For several decades, online transaction processing (OLTP) has been one of the main server applications that drives innovations in the data management ecosystem, and in turn the database and computer architecture communities. Recent hardware trends oblige software to overcome two major challenges against systems scalability on modern multicore processors: (1) exploiting the abundant thread-level parallelism across cores and (2) taking advantage of the implicit parallelism within a core. The traditional design of the OLTP systems, however, faces inherent scalability problems due to its tightly coupled components. In addition, OLTP cannot exploit the full capability of the micro-architectural resources of modern processors because of the conventional scheduling decisions that ignore the cache locality for transactions. As a result, today’s commonly used server hardware remains largely underutilized leading to a huge waste of hardware resources and energy. .... In this thesis, we first identify the unbounded critical sections of traditional OLTP systems as the main enemy of thread-level parallelism. We design an alternative shared-everything system based on physiological partitioning (PLP) to eliminate the unbounded critical sections while providing an infrastructure for low-cost dynamic repartitioning and without introducing high-cost distributed transactions. Then, we demonstrate that L1 instruction cache stalls are the dominant factor leading to underutilization in the commodity servers. However, we also observe that independently of their high-level functionality, transactions running in parallel on a multicore system share significant amount of common instructions. By adaptively spreading the execution of a transaction over multiple cores through thread migration or multiplexing transactions on one core, we enable both an ample L1 instruction cache capacity for a transaction and reuse of common instructions across concurrent transactions. .... As the hardware demands more from the software to exploit the complexity and parallelism it offers in the multicore era, this work would change the way we traditionally schedule transactions. Instead of viewing a transaction as a single big task, we split it into smaller parts that can exploit data and instruction locality through careful dynamic scheduling decisions. The methods this thesis presents are not only specific to OLTP systems, but they can also benefit other types of applications that have concurrent requests executing a series of actions from a predefined set and face similar scalability problems on emerging hardware

    Micro-architectural analysis of in-memory OLTP: Revisited

    Get PDF
    Micro-architectural behavior of traditional disk-based online transaction processing (OLTP) systems has been investigated extensively over the past couple of decades. Results show that traditional OLTP systems mostly under-utilize the available micro-architectural resources. In-memory OLTP systems, on the other hand, process all the data in main-memory and, therefore, can omit the buffer pool. Furthermore, they usually adopt more lightweight concurrency control mechanisms, cache-conscious data structures, and cleaner codebases since they are usually designed from scratch. Hence, we expect significant differences in micro-architectural behavior when running OLTP on platforms optimized for in-memory processing as opposed to disk-based database systems. In particular, we expect that in-memory systems exploit micro-architectural features such as instruction and data caches significantly better than disk-based systems. This paper sheds light on the micro-architectural behavior of in-memory database systems by analyzing and contrasting it to the behavior of disk-based systems when running OLTP workloads. The results show that, despite all the design changes, in-memory OLTP exhibits very similar micro-architectural behavior to disk-based OLTP: more than half of the execution time goes to memory stalls where instruction cache misses or the long-latency data misses from the last-level cache (LLC) are the dominant factors in the overall execution time. Even though ground-up designed in-memory systems can eliminate the instruction cache misses, the reduction in instruction stalls amplifies the impact of LLC data misses. As a result, only 30% of the CPU cycles are used to retire instructions, and 70% of the CPU cycles are wasted to stalls for both traditional disk-based and new generation in-memory OLTP

    Towards Scalable OLTP Over Fast Networks

    Get PDF
    Online Transaction Processing (OLTP) underpins real-time data processing in many mission-critical applications, from banking to e-commerce. These applications typically issue short-duration, latency-sensitive transactions that demand immediate processing. High-volume applications, such as Alibaba's e-commerce platform, achieve peak transaction rates as high as 70 million transactions per second, exceeding the capacity of a single machine. Instead, distributed OLTP database management systems (DBMS) are deployed across multiple powerful machines. Historically, such distributed OLTP DBMSs have been primarily designed to avoid network communication, a paradigm largely unchanged since the 1980s. However, fast networks challenge the conventional belief that network communication is the main bottleneck. In particular, emerging network technologies, like Remote Direct Memory Access (RDMA), radically alter how data can be accessed over a network. RDMA's primitives allow direct access to the memory of a remote machine within an order of magnitude of local memory access. This development invalidates the notion that network communication is the primary bottleneck. Given that traditional distributed database systems have been designed with the premise that the network is slow, they cannot efficiently exploit these fast network primitives, which requires us to reconsider how we design distributed OLTP systems. This thesis focuses on the challenges RDMA presents and its implications on the design of distributed OLTP systems. First, we examine distributed architectures to understand data access patterns and scalability in modern OLTP systems. Drawing on these insights, we advocate a distributed storage engine optimized for high-speed networks. The storage engine serves as the foundation of a database, ensuring efficient data access through three central components: indexes, synchronization primitives, and buffer management (caching). With the introduction of RDMA, the landscape of data access has undergone a significant transformation. This requires a comprehensive redesign of the storage engine components to exploit the potential of RDMA and similar high-speed network technologies. Thus, as the second contribution, we design RDMA-optimized tree-based indexes — especially applicable for disaggregated databases to access remote data efficiently. We then turn our attention to the unique challenges of RDMA. One-sided RDMA, one of the network primitives introduced by RDMA, presents a performance advantage in enabling remote memory access while bypassing the remote CPU and the operating system. This allows the remote CPU to process transactions uninterrupted, with no requirement to be on hand for network communication. However, that way, specialized one-sided RDMA synchronization primitives are required since traditional CPU-driven primitives are bypassed. We found that existing RDMA one-sided synchronization schemes are unscalable or, even worse, fail to synchronize correctly, leading to hard-to-detect data corruption. As our third contribution, we address this issue by offering guidelines to build scalable and correct one-sided RDMA synchronization primitives. Finally, recognizing that maintaining all data in memory becomes economically unattractive, we propose a distributed buffer manager design that efficiently utilizes cost-effective NVMe flash storage. By leveraging low-latency RDMA messages, our buffer manager provides a transparent memory abstraction, accessing the aggregated DRAM and NVMe storage across nodes. Central to our approach is a distributed caching protocol that dynamically caches data. With this approach, our system can outperform RDMA-enabled in-memory distributed databases while managing larger-than-memory datasets efficiently

    High Performance Transaction Processing on Non-Uniform Hardware Topologies

    Get PDF
    Transaction processing is a mission critical enterprise application that runs on high-end servers. Traditionally, transaction processing systems have been designed for uniform core-to-core communication latencies. In the past decade, with the emergence of multisocket multicores, for the first time we have Islands, i.e., groups of cores that communicate fast among themselves and slower with other groups. In current mainstream servers, each multicore processor corresponds to an Island. As the number of cores on a chip increases, however, we expect that multiple Islands will form within a single processor in the nearby future. In addition, the access latencies to the local memory and to the memory of another server over fast interconnect are converging, thus creating a hierarchy of Islands within a group of servers. Non-uniform hardware topologies pose a significant challenge to the scalability and the predictability of performance of transaction processing systems. Distributed transaction processing systems can alleviate this problem; however, no single deployment configuration is optimal for all workloads and hardware topologies. In order to fully utilize the available processing power, a transaction processing system needs to adapt to the underlying hardware topology and tune its configuration to the current workload. More specifically, the system should be able to detect any changes to the workload and hardware topology, and adapt accordingly without disrupting the processing. In this thesis, we first systematically quantify the impact of hardware Islands on deployment configurations of distributed transaction processing systems. We show that none of these configurations is optimal for all workloads, and the choice of the optimal configuration depends on the combination of the workload and hardware topology. In the cluster setting, on the other hand, the choice of optimal configuration additionally depends on the properties of the communication channel between the servers. We address this challenge by designing a dynamic shared-everything system that adapts its data structures automatically to hardware Islands. To ensure good performance in the presence of shifting workload patterns, we use a lightweight partitioning and placement mechanism to balance the load and minimize the synchronization overheads across Islands. Overall, we show that masking the non-uniformity of inter-core communication is critical for achieving predictably high performance for latency-sensitive applications, such as transaction processing. With clusters of a handful of multicore chips with large main memories replacing high-end many-socket servers, the deployment rules of thumb identified in our analysis have a potential to significantly reduce the synchronization and communication costs of transaction processing. As workloads become more dynamic and diverse, while still running on partitioned infrastructure, the lightweight monitoring and adaptive repartitioning mechanisms proposed in this thesis will be applicable to a wide range of designs for which traditional offline schemes are impractical
    • …
    corecore