7 research outputs found

    Towards fair, scalable, locking

    Get PDF
    Without care, Hardware Transactional Memory presents several performance pathologies that can degrade its performance. Among them, writers of commonly read variables can suffer from starvation. Though different solutions have been proposed for HTM systems, hybrid systems can still suffer from this performance problem, given that software transactions don’t interact with the mechanisms used by hardware to avoid starvation. In this paper we introduce a new per-directory-line hardware contention management mechanism that allows fairer access between both software and hardware threads without the need to abort any transaction. Our mechanism is based on “reserving” directory lines, implementing a limited fair queue for the requests on that line. We adapt the mechanism to the LogTM conflict detection mechanism and show that the resulting proposal is deadlock free. Finally, we sketch how the idea could be applied more generally to reader-writer locks.Postprint (published version

    Preemption Adaptivity in Time-Published Queue-Based Spin Locks

    No full text
    The proliferation of multiprocessor servers and multithreaded applications has increased the demand for high-performance synchronization. Traditional scheduler-based locks incur the overhead of a full context switch between threads and are thus unacceptably slow for many applications. Spin locks offer low overhead, but they either scale poorly on large-scale SMPs (test-and-set style locks) or behave poorly in the presence of preemption (queue-based locks). Previous work has shown how to build preemption-tolerant locks using an extended kernel interface, but such locks are neither portable to nor even compatible with most operating systems. In this work, we propose a time-publishing heuristic in which each thread periodically records its current timestamp to a shared memory location. Given the high resolution, roughly synchronized clocks of modern processors, this convention allows threads to guess accurately which peers are active based on the currency of their timestamps. We implement two queuebased locks, MCS-TP and CLH-TP, and evaluate their performance relative to both traditional spin locks and preemption-safe locks on a 32-processor IBM p690 multiprocessor. Experimental results indicate that time-published locks make it feasible, for the first time, to use queue-based spin locks on multiprogrammed systems with a standard kernel interface.

    Preemption Adaptivity in Time-Published Queue-Based Spin Locks

    No full text
    The proliferation of multiprocessor servers and multithreaded applications has increased the demand for high-performance synchronization. Traditional scheduler-based locks incur the overhead of a full context switch between threads and are thus unacceptably slow for many applications. Spin locks offer low overhead, but they either scale poorly on large-scale SMPs (test-and-set style locks) or behave poorly in the presence of preemption (queue-based locks). Previous work has shown how to build preemption-tolerant locks using an extended kernel interface, but such locks are neither portable to nor even compatible with most operating systems. In this work, we propose a time-publishing heuristic in which each thread periodically records its current timestamp to a shared memory location. Given the high resolution, roughly synchronized clocks of modern processors, this convention allows threads to guess accurately which peers are active based on the currency of their timestamps. We implement two queue-based locks, MCS-TP and CLH-TP, and evaluate their performance relative to both traditional spin locks and preemption-safe locks on a 32-processor IBM p690 multiprocessor. Experimental results indicate that time-published locks make it feasible, for the first time, to use queue-based spin locks on multiprogrammed systems with a standard kernel interface

    High performance cloud computing on multicore computers

    Get PDF
    The cloud has become a major computing platform, with virtualization being a key to allow applications to run and share the resources in the cloud. A wide spectrum of applications need to process large amounts of data at high speeds in the cloud, e.g., analyzing customer data to find out purchase behavior, processing location data to determine geographical trends, or mining social media data to assess brand sentiment. To achieve high performance, these applications create and use multiple threads running on multicore processors. However, existing virtualization technology cannot support the efficient execution of such applications on virtual machines, making them suffer poor and unstable performance in the cloud. Targeting multi-threaded applications, the dissertation analyzes and diagnoses their performance issues on virtual machines, and designs practical solutions to improve their performance. The dissertation makes the following contributions. First, the dissertation conducts extensive experiments with standard multicore applications, in order to evaluate the performance overhead on virtualization systems and diagnose the causing factors. Second, focusing on one main source of the performance overhead, excessive spinning, the dissertation designs and evaluates a holistic solution to make effective utilization of the hardware virtualization support in processors to reduce excessive spinning with low cost. Third, focusing on application scalability, which is the most important performance feature for multi-threaded applications, the dissertation models application scalability in virtual machines and analyzes how application scalability changes with virtualization and resource sharing. Based on the modeling and analysis, the dissertation identifies key application features and system factors that have impacts on application scalability, and reveals possible approaches for improving scalability. Forth, the dissertation explores one approach to improving application scalability by making fully utilization of virtual resources of each virtual machine. The general idea is to match the workload distribution among the virtual CPUs in a virtual machine and the virtual CPU resource of the virtual machine manager

    Fast and Portable Locking for Multicore Architectures

    Get PDF
    International audienceThe scalability of multithreaded applications on current multicore systems is hampered by the performance of lock algorithms, due to the costs of access contention and cache misses. The main contribution presented in this article is a new locking technique, Remote Core Locking (RCL), that aims to accelerate the execution of critical sections in legacy applications on multicore architectures. The idea of RCL is to replace lock acquisitions by optimized remote procedure calls to a dedicated server hardware thread. RCL limits the performance collapse observed with other lock algorithms when many threads try to acquire a lock concurrently and removes the need to transfer lock-protected shared data to the hardware thread acquiring the lock, because such data can typically remain in the server's cache. Other contributions presented in this article include a profiler that identifies the locks that are the bottlenecks in multithreaded applications and that can thus benefit from RCL, and a reengineering tool that transforms POSIX lock acquisitions into RCL locks. Eighteen applications were used to evaluate RCL: the nine applications of the SPLASH-2 benchmark suite, the seven applications of the Phoenix 2 benchmark suite, Memcached, and Berkeley DB with a TPC-C client. Eight of these applications are unable to scale because of locks and benefit from RCL on an Ă—86 machine with four AMD Opteron processors and 48 hardware threads. By using RCL instead of Linux POSIX locks, performance is improved by up to 2.5 times on Memcached, and up to 11.6 times on Berkeley DB with the TPC-C client. On a SPARC machine with two Sun Ultrasparc T2+ processors and 128 hardware threads, three applications benefit from RCL. In particular, performance is improved by up to 1.3 times with respect to Solaris POSIX locks on Memcached, and up to 7.9 times on Berkeley DB with the TPC-C client.. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publication

    Towards Scalable Synchronization on Multi-Cores

    Get PDF
    The shift of commodity hardware from single- to multi-core processors in the early 2000s compelled software developers to take advantage of the available parallelism of multi-cores. Unfortunately, only few---so-called embarrassingly parallel---applications can leverage this available parallelism in a straightforward manner. The remaining---non-embarrassingly parallel---applications require that their processes coordinate their possibly interleaved executions to ensure overall correctness---they require synchronization. Synchronization is achieved by constraining or even prohibiting parallel execution. Thus, per Amdahl's law, synchronization limits software scalability. In this dissertation, we explore how to minimize the effects of synchronization on software scalability. We show that scalability of synchronization is mainly a property of the underlying hardware. This means that synchronization directly hampers the cross-platform performance portability of concurrent software. Nevertheless, we can achieve portability without sacrificing performance, by creating design patterns and abstractions, which implicitly leverage hardware details without exposing them to software developers. We first perform an exhaustive analysis of the performance behavior of synchronization on several modern platforms. This analysis clearly shows that the performance and scalability of synchronization are highly dependent on the characteristics of the underlying platform. We then focus on lock-based synchronization and analyze the energy/performance trade-offs of various waiting techniques. We show that the performance and the energy efficiency of locks go hand in hand on modern x86 multi-cores. This correlation is again due to the characteristics of the hardware that does not provide practical tools for reducing the power consumption of locks without sacrificing throughput. We then propose two approaches for developing portable and scalable concurrent software, hence hiding the limitations that the underlying multi-cores impose. First, we introduce OPTIK, a new practical design pattern for designing and implementing fast and scalable concurrent data structures. We illustrate the power of our OPTIK pattern by devising five new algorithms and by optimizing four state-of-the-art algorithms for linked lists, skip lists, hash tables, and queues. Second, we introduce MCTOP, a multi-core topology abstraction which includes low-level information, such as memory bandwidths. MCTOP enables developers to accurately and portably define high-level optimization policies. We illustrate several such policies through four examples, including automated backoff schemes for locks, and illustrate the performance and portability of these policies on five platforms
    corecore