57 research outputs found

    Locality-Adaptive Parallel Hash Joins Using Hardware Transactional Memory

    Get PDF
    Previous work [1] has claimed that the best performing implementation of in-memory hash joins is based on (radix-)partitioning of the build-side input. Indeed, despite the overhead of partitioning, the benefits from increased cache-locality and synchronization free parallelism in the build-phase outweigh the costs when the input data is randomly ordered. However, many datasets already exhibit significant spatial locality (i.e., non-randomness) due to the way data items enter the database: through periodic ETL or trickle loaded in the form of transactions. In such cases, the first benefit of partitioning — increased locality — is largely irrelevant. In this paper, we demonstrate how hardware transactional memory (HTM) can render the other benefit, freedom from synchronization, irrelevant as well. Specifically, using careful analysis and engineering, we develop an adaptive hash join implementation that outperforms parallel radix-partitioned hash joins as well as sort-merge joins on data with high spatial locality. In addition, we show how, through lightweight (less than 1% overhead) runtime monitoring of the transaction abort rate, our implementation can detect inputs with low spatial locality and dynamically fall back to radix-partitioning of the build-side input. The result is a hash join implementation that is more than 3 times faster than the state-of-the-art on high-locality data and never more than 1% slower

    Performance Comparison of Various STM Concurrency Control Protocols Using Synchrobench

    Get PDF
    Writing concurrent programs for shared memory multiprocessor systems is a nightmare. This hinders users to exploit the full potential of multiprocessors. STM (Software Transactional Memory) is a promising concurrent programming paradigm which addresses woes of programming for multiprocessor systems. In this paper, we implement BTO (Basic Timestamp Ordering), SGT (Serialization Graph Testing) and MVTO(Multi-Version Time-Stamp Ordering) concurrency control protocols and build an STM(Software Transactional Memory) library to evaluate the performance of these protocols. The deferred write approach is followed to implement the STM. A SET data structure is implemented using the transactions of our STM library. And this transactional SET is used as a test application to evaluate the STM. The performance of the protocols is rigorously compared against the linked-list module of the Synchrobench benchmark. Linked list module implements SET data structure using lazy-list, lock-free list, lock-coupling list and ESTM (Elastic Software Transactional Memory). Our analysis shows that for a number of threads greater than 60 and update rate 70%, BTO takes (17% to 29%) and (6% to 24%) less CPU time per thread when compared against lazy-list and lock-coupling list respectively. MVTO takes (13% to 24%) and (3% to 24%) less CPU time per thread when compared against lazy-list and lock-coupling list respectively. BTO and MVTO have similar per thread CPU time. BTO and MVTO outperform SGT by 9% to 36%

    Critical Sections: Re-Emerging Scalability Concerns for Database Storage Engines

    Get PDF
    Critical sections in database storage engines impact performance and scalability more as the number of hardware contexts per chip continues to grow exponentially. With enough threads in the system, some critical section will eventually become a bottleneck. While algorithmic changes are the only long-term solution, they tend to be complex and costly to develop. Meanwhile, changes in enforcement of critical sections require much less effort. We observe that, in practice, many critical sections are so short that enforcing them contributes a significant or even dominating fraction of their total cost and tuning them directly improves database system performance. The contribution of this paper is two-fold: we (a) make a thorough performance comparison of the various synchronization primitives in the database system developer’s toolbox and highlight the best ones for practical use, and (b) show that properly enforcing critical sections can delay the need to make algorithmic changes for a target number of processors

    Speedy Transactions in Multicore In-Memory Databases

    Get PDF
    Silo is a new in-memory database that achieves excellent performance and scalability on modern multicore machines. Silo was designed from the ground up to use system memory and caches efficiently. For instance, it avoids all centralized contention points, including that of centralized transaction ID assignment. Silo's key contribution is a commit protocol based on optimistic concurrency control that provides serializability while avoiding all shared-memory writes for records that were only read. Though this might seem to complicate the enforcement of a serial order, correct logging and recovery is provided by linking periodically-updated epochs with the commit protocol. Silo provides the same guarantees as any serializable database without unnecessary scalability bottlenecks or much additional latency. Silo achieves almost 700,000 transactions per second on a standard TPC-C workload mix on a 32-core machine, as well as near-linear scalability. Considered per core, this is several times higher than previously reported results.Engineering and Applied Science

    Applying HTM to an OLTP System: No Free Lunch

    Get PDF
    Transactional memory is a promising way for implementing efficient synchronization mechanisms for multicore processors. Intel's introduction of hardware transactional memory (HTM) into their Haswell line of processors marks an important step toward mainstream availability of transactional memory. Transaction processing systems require execution of dozens of critical sections to insure isolation among threads, which makes them one of the target applications for exploiting HTM. In this study, we quantify the opportunities and limitations of directly applying HTM to an existing OLTP system that uses fine-grained synchronization. Our target is Shore-MT, a modern multithreaded transactional storage manager that uses a variety of fine-grained synchronization mechanisms to provide scalability on multicore processors. We find that HTM can improve performance of the TATP workload by 13-17% when applied judiciously. However, attempting to replace all synchronization reduces performance compared to the baseline case due to high percentage of aborts caused by the limitations of the current HTM implementation. Copyright 2015 ACM

    Decoupling contention management from scheduling

    Get PDF
    Many parallel applications exhibit unpredictable communication between threads, leading to contention for shared objects. The choice of contention management strategy impacts strongly the performance and scalability of these applications: spinning provides maximum performance but wastes significant processor resources, while blocking-based approaches conserve processor resources but introduce high overheads on the critical path of computation. Under situations of high or changing load, the operating system complicates matters further with arbitrary scheduling decisions which often preempt lock holders, leading to long serialization delays until the preempted thread resumes execution. We observe that contention management is orthogonal to the problems of scheduling and load management and propose to decouple them so each may be solved independently and effectively. To this end, we propose a load control mechanism which manages the number of active threads in the system separately from any contention which may exist. By isolating contention management from damaging interactions with the OS scheduler, we combine the efficiency of spinning with the robustness of blocking. The proposed load control mechanism results in stable, high performance for both lightly and heavily loaded systems, requires no special privileges or modifications at the OS level, and can be implemented as a library which benefits existing code

    Decoupling contention management from scheduling

    Full text link

    Towards Scalable Synchronization on Multi-Cores

    Get PDF
    The shift of commodity hardware from single- to multi-core processors in the early 2000s compelled software developers to take advantage of the available parallelism of multi-cores. Unfortunately, only few---so-called embarrassingly parallel---applications can leverage this available parallelism in a straightforward manner. The remaining---non-embarrassingly parallel---applications require that their processes coordinate their possibly interleaved executions to ensure overall correctness---they require synchronization. Synchronization is achieved by constraining or even prohibiting parallel execution. Thus, per Amdahl's law, synchronization limits software scalability. In this dissertation, we explore how to minimize the effects of synchronization on software scalability. We show that scalability of synchronization is mainly a property of the underlying hardware. This means that synchronization directly hampers the cross-platform performance portability of concurrent software. Nevertheless, we can achieve portability without sacrificing performance, by creating design patterns and abstractions, which implicitly leverage hardware details without exposing them to software developers. We first perform an exhaustive analysis of the performance behavior of synchronization on several modern platforms. This analysis clearly shows that the performance and scalability of synchronization are highly dependent on the characteristics of the underlying platform. We then focus on lock-based synchronization and analyze the energy/performance trade-offs of various waiting techniques. We show that the performance and the energy efficiency of locks go hand in hand on modern x86 multi-cores. This correlation is again due to the characteristics of the hardware that does not provide practical tools for reducing the power consumption of locks without sacrificing throughput. We then propose two approaches for developing portable and scalable concurrent software, hence hiding the limitations that the underlying multi-cores impose. First, we introduce OPTIK, a new practical design pattern for designing and implementing fast and scalable concurrent data structures. We illustrate the power of our OPTIK pattern by devising five new algorithms and by optimizing four state-of-the-art algorithms for linked lists, skip lists, hash tables, and queues. Second, we introduce MCTOP, a multi-core topology abstraction which includes low-level information, such as memory bandwidths. MCTOP enables developers to accurately and portably define high-level optimization policies. We illustrate several such policies through four examples, including automated backoff schemes for locks, and illustrate the performance and portability of these policies on five platforms
    corecore