4 research outputs found

    ForestGOMP: an efficient OpenMP environment for NUMA architectures

    Get PDF
    International audienceExploiting the full computational power of current hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture so as to avoid remote memory access penalties. Directive-based programming languages such as OpenMP, can greatly help to perform such a distribution by providing programmers with an easy way to structure the parallelism of their application and to transmit this information to the runtime system. Our runtime, which is based on a multi-level thread scheduler combined with a NUMA-aware memory manager, converts this information into Scheduling Hints related to thread-memory affinity issues. These hints enable dynamic load distribution guided by application structure and hardware topology, thus helping to achieve performance portability. Several experiments show that mixed solutions (migrating both threads and data) outperform work-stealing based balancing strategies and Next-Touch-based data distribution policies. These techniques provide insights about additional optimizations

    Scheduler-activated dynamic page migration for multiprogrammed DSM multiprocessors

    No full text
    The performance of multiprogrammed shared-memory multiprocessors suffers often from scheduler interventions that neglect data locality. On cache-coherent distributed shared-memory (DSM) multiprocessors, such scheduler interventions tend to increase the rate of remote memory accesses. This paper presents a novel dynamic page migration algorithm that remedies this problem in iterative parallel programs. The purpose of the algorithm is the early detection of pages that will most likely be accessed remotely by threads associated with them via a thread-to-memory affinity relation. The key mechanism that enables timely identification of these pages is a communication interface between the page migration engine and the operating system scheduler. The algorithm improves previously proposed competitive page migration algorithms in many aspects, including accuracy, timeliness and cost amortization. Most notably, the algorithm is not biased by obsolete memory access history that may be accumulated in the page access counters at runtime. Experiments on the SGI Origin2000 show that the algorithm outperforms by far the best static page placement algorithm and a customized page migration engine implemented in IRIX, the Origin2000 operating system. The algorithm is implemented at user-level and its functionality is orthogonal to the scheduling policy of the operating system.Peer ReviewedPostprint (published version

    Scheduler-Activated Dynamic Page Migration for Multiprogrammed DSM Multiprocessors

    No full text
    The performance of multiprogrammed shared-memory multiprocessors suffers often from scheduler interventions that neglect data locality.On cachecoherent distributed shared-memory (DSM) multiprocessors, such scheduler interventions tend to increase the rate of remote memory accesses.This paper presents a novel dynamic page migration algorithm that remedies this problem in iterative parallel programs.The purpose of the algorithm is the early detection of pages that will most likely be accessed remotely by threads associated with them via a thread-to-memory affinity relation.The key mechanism that enables timely identification of these pages is a communication interface between the page migration engine and the operating system scheduler.The algorithm improves previously proposed competitive page migration algorithms in many aspects, including accuracy, timeliness and cost amortization.Most notably, the algorithm is not biased by obsolete memor

    Scaling Up Concurrent Analytical Workloads on Multi-Core Servers

    Get PDF
    Today, an ever-increasing number of researchers, businesses, and data scientists collect and analyze massive amounts of data in database systems. The database system needs to process the resulting highly concurrent analytical workloads by exploiting modern multi-socket multi-core processor systems with non-uniform memory access (NUMA) architectures and increasing memory sizes. Conventional execution engines, however, are not designed for many cores, and neither scale nor perform efficiently on modern multi-core NUMA architectures. Firstly, their query-centric approach, where each query is optimized and evaluated independently, can result in unnecessary contention for hardware resources due to redundant work found across queries in highly concurrent workloads. Secondly, they are unaware of the non-uniform memory access costs and the underlying hardware topology, incurring unnecessarily expensive memory accesses and bandwidth saturation. In this thesis, we show how these scalability and performance impediments can be solved by exploiting sharing among concurrent queries and incorporating NUMA-aware adaptive task scheduling and data placement strategies in the execution engine. Regarding sharing, we identify and categorize state-of-the-art techniques for sharing data and work across concurrent queries at run-time into two categories: reactive sharing, which shares intermediate results across common query sub-plans, and proactive sharing, which builds a global query plan with shared operators to evaluate queries. We integrate the original research prototypes that introduce reactive and proactive sharing, perform a sensitivity analysis, and show how and when each technique benefits performance. Our most significant finding is that reactive and proactive sharing can be combined to exploit the advantages of both sharing techniques for highly concurrent analytical workloads. Regarding NUMA-awareness, we identify, implement, and compare various combinations of task scheduling and data placement strategies under a diverse set of highly concurrent analytical workloads. We develop a prototype based on a commercial main-memory column-store database system. Our most significant finding is that there is no single strategy for task scheduling and data placement that is best for all workloads. In specific, inter-socket stealing of memory-intensive tasks can hurt overall performance, and unnecessary partitioning of data across sockets involves an overhead. For this reason, we implement algorithms that adapt task scheduling and data placement to the workload at run-time. Our experiments show that both sharing and NUMA-awareness can significantly improve the performance and scalability of highly concurrent analytical workloads on modern multi-core servers. Thus, we argue that sharing and NUMA-awareness are key factors for supporting faster processing of big data analytical applications, fully exploiting the hardware resources of modern multi-core servers, and for more responsive user experience
    corecore