25 research outputs found

    Bottleneck identification and scheduling in multithreaded applications

    Get PDF
    Abstract Performance of multithreaded applications is limited by a variety of bottlenecks, e.g. critical sections, barriers and slow pipeline stages. These bottlenecks serialize execution, waste valuable execution cycles, and limit scalability of applications. This paper proposes Bottleneck Identification and Scheduling (BIS), a cooperative software-hardware mechanism to identify and accelerate the most critical bottlenecks. BIS identifies which bottlenecks are likely to reduce performance by measuring the number of cycles threads have to wait for each bottleneck, and accelerates those bottlenecks using one or more fast cores on an Asymmetric Chip MultiProcessor (ACMP). Unlike previous work that targets specific bottlenecks, BIS can identify and accelerate bottlenecks regardless of their type. We compare BIS to four previous approaches and show that it outperforms the best of them by 15% on average. BIS' performance improvement increases as the number of cores and the number of fast cores in the system increase

    POSTER: Exploiting asymmetric multi-core processors with flexible system sofware

    Get PDF
    Energy efficiency has become the main challenge for high performance computing (HPC). The use of mobile asymmetric multi-core architectures to build future multi-core systems is an approach towards energy savings while keeping high performance. However, it is not known yet whether such systems are ready to handle parallel applications. This paper fills this gap by evaluating emerging parallel applications on an asymmetric multi-core. We make use of the PARSEC benchmark suite and a processor that implements the ARM big.LITTLE architecture. We conclude that these applications are not mature enough to run on such systems, as they suffer from load imbalance. Furthermore, we explore the behaviour of dynamic scheduling solutions on either the Operating System (OS) or the runtime level. Comparing these approaches shows us that the most efficient scheduling takes place in the runtime level, influencing the future research towards such solutions.This work has been supported by the Spanish Government (SEV2015-0493), by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P), by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), by the RoMoL ERC Advanced Grant (GA 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project receives funding from the EU's Seventh Framework Programme (FP7/2007-2013) under grant agreement number 610402 and from the EU's H2020 Framework Programme (H2020/2014-2020) under grant agreement number 671697. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047. M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243).Peer ReviewedPostprint (author's final draft

    Proactive bottleneck performance analysis in parallel computing using openMP

    Full text link
    The aim of parallel computing is to increase an application performance by executing the application on multiple processors. OpenMP is an API that supports multi platform shared memory programming model and shared-memory programs are typically executed by multiple threads. The use of multi threading can enhance the performance of application but its excessive use can degrade the performance. This paper describes a novel approach to avoid bottlenecks in application and provide some techniques to improve performance in OpenMP application. This paper analyzes bottleneck performance as bottleneck inhibits performance. Performance of multi threaded applications is limited by a variety of bottlenecks, e.g. critical sections, barriers and so on. This paper provides some tips how to avoid performance bottleneck problems. This paper focuses on how to reduce overheads and overall execution time to get better performance of application.Comment: 8 Pages,6 figur

    A study of recent contributions on performance and simulation techniques for accelerator devices

    Get PDF
    High performance computing platform is moving from homogeneous individual unites to heterogeneous systems where each unit is a combination of homogeneous cores and accelerator devices. Accelerators such as GPUs, FPGAs, DSPs, are usually designed for specific and intensive type of computing tasks. The presence of these devises have created fresh and attractive development platforms for developers and designers as well as novel performance analysis frameworks and optimization tools. This is the cutting edge in performance of some accelerator devices like: GPUs and Intel’s Xeon Phi. We outline some of the existing heterogeneous systems and their development frameworks. The core of this study is a review of performance modeling of these devices

    Fairness-aware scheduling on single-ISA heterogeneous multi-cores

    Get PDF
    Single-ISA heterogeneous multi-cores consisting of small (e.g., in-order) and big (e.g., out-of-order) cores dramatically improve energy- and power-efficiency by scheduling workloads on the most appropriate core type. A significant body of recent work has focused on improving system throughput through scheduling. However, none of the prior work has looked into fairness. Yet, guaranteeing that all threads make equal progress on heterogeneous multi-cores is of utmost importance for both multi-threaded and multi-program workloads to improve performance and quality-of-service. Furthermore, modern operating systems affinitize workloads to cores (pinned scheduling) which dramatically affects fairness on heterogeneous multi-cores. In this paper, we propose fairness-aware scheduling for single-ISA heterogeneous multi-cores, and explore two flavors for doing so. Equal-time scheduling runs each thread or workload on each core type for an equal fraction of the time, whereas equal-progress scheduling strives at getting equal amounts of work done on each core type. Our experimental results demonstrate an average 14% (and up to 25%) performance improvement over pinned scheduling through fairness-aware scheduling for homogeneous multi-threaded workloads; equal-progress scheduling improves performance by 32% on average for heterogeneous multi-threaded workloads. Further, we report dramatic improvements in fairness over prior scheduling proposals for multi-program workloads, while achieving system throughput comparable to throughput-optimized scheduling, and an average 21% improvement in throughput over pinned scheduling

    Coz: Finding Code that Counts with Causal Profiling

    Full text link
    Improving performance is a central concern for software developers. To locate optimization opportunities, developers rely on software profilers. However, these profilers only report where programs spent their time: optimizing that code may have no impact on performance. Past profilers thus both waste developer time and make it difficult for them to uncover significant optimization opportunities. This paper introduces causal profiling. Unlike past profiling approaches, causal profiling indicates exactly where programmers should focus their optimization efforts, and quantifies their potential impact. Causal profiling works by running performance experiments during program execution. Each experiment calculates the impact of any potential optimization by virtually speeding up code: inserting pauses that slow down all other code running concurrently. The key insight is that this slowdown has the same relative effect as running that line faster, thus "virtually" speeding it up. We present Coz, a causal profiler, which we evaluate on a range of highly-tuned applications: Memcached, SQLite, and the PARSEC benchmark suite. Coz identifies previously unknown optimization opportunities that are both significant and targeted. Guided by Coz, we improve the performance of Memcached by 9%, SQLite by 25%, and accelerate six PARSEC applications by as much as 68%; in most cases, these optimizations involve modifying under 10 lines of code.Comment: Published at SOSP 2015 (Best Paper Award

    GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux Applications

    Full text link
    We present a parallel profiling tool, GAPP, that identifies serialization bottlenecks in parallel Linux applications arising from load imbalance or contention for shared resources . It works by tracing kernel context switch events using kernel probes managed by the extended Berkeley Packet Filter (eBPF) framework. The overhead is thus extremely low (an average 4% run time overhead for the applications explored), the tool requires no program instrumentation and works for a variety of serialization bottlenecks. We evaluate GAPP using the Parsec3.0 benchmark suite and two large open-source projects: MySQL and Nektar++ (a spectral/hp element framework). We show that GAPP is able to reveal a wide range of bottleneck-related performance issues, for example arising from synchronization primitives, busy-wait loops, memory operations, thread imbalance and resource contention.Comment: 8 page
    corecore