11 research outputs found

    A Fast Causal Profiler for Task Parallel Programs

    Full text link
    This paper proposes TASKPROF, a profiler that identifies parallelism bottlenecks in task parallel programs. It leverages the structure of a task parallel execution to perform fine-grained attribution of work to various parts of the program. TASKPROF's use of hardware performance counters to perform fine-grained measurements minimizes perturbation. TASKPROF's profile execution runs in parallel using multi-cores. TASKPROF's causal profile enables users to estimate improvements in parallelism when a region of code is optimized even when concrete optimizations are not yet known. We have used TASKPROF to isolate parallelism bottlenecks in twenty three applications that use the Intel Threading Building Blocks library. We have designed parallelization techniques in five applications to in- crease parallelism by an order of magnitude using TASKPROF. Our user study indicates that developers are able to isolate performance bottlenecks with ease using TASKPROF.Comment: 11 page

    Analysis and optimization of task granularity on the Java virtual machine

    Get PDF
    Task granularity, i.e., the amount of work performed by parallel tasks, is a key performance attribute of parallel applications. On the one hand, fine-grained tasks (i.e., small tasks carrying out few computations) may introduce considerable parallelization overheads. On the other hand, coarse-grained tasks (i.e., large tasks performing substantial computations) may not fully utilize the available CPU cores, leading to missed parallelization opportunities. We focus on task-parallel applications running in a single Java Virtual Machine on a shared- memory multicore. Despite their performance may considerably depend on the granularity of their tasks, this topic has received little attention in the literature. Our work fills this gap, analyzing and optimizing the task granularity of such applications. In this dissertation, we present a new methodology to accurately and efficiently collect the granularity of each executed task, implemented in a novel profiler. Our profiler collects carefully selected metrics from the whole system stack with low overhead. Our tool helps developers locate performance and scalability problems, and identifies classes and methods where optimizations related to task granularity are needed, guiding developers towards useful optimizations. Moreover, we introduce a novel technique to drastically reduce the overhead of task-granularity profiling, by reifying the class hierarchy of the target application within a separate instrumentation process. Our approach allows the instrumentation process to instrument only the classes representing tasks, inserting more efficient instrumentation code which decreases the overhead of task detection. Our technique significantly speeds up task-granularity profiling and so enables the collection of accurate metrics with low overhead.We use our novel techniques to analyze task granularity in the DaCapo, ScalaBench, and Spark Perf benchmark suites. We reveal inefficiencies related to fine-grained and coarse-grained tasks in several workloads. We demonstrate that the collected task-granularity profiles are actionable by optimizing task granularity in numerous benchmarks, performing optimizations in classes and methods indicated by our tool. Our optimizations result in significant speedups (up to a factor of 5.90x) in numerous workloads suffering from fine- and coarse-grained tasks in different environments. Our results highlight the importance of analyzing and optimizing task granularity on the Java Virtual Machine

    Parallel Real-Time Scheduling for Latency-Critical Applications

    Get PDF
    In order to provide safety guarantees or quality of service guarantees, many of today\u27s systems consist of latency-critical applications, e.g. applications with timing constraints. The problem of scheduling multiple latency-critical jobs on a multiprocessor or multicore machine has been extensively studied for sequential (non-parallizable) jobs and different system models and different objectives have been considered. However, the computational requirement of a single job is still limited by the capacity of a single core. To provide increasingly complex functionalities of applications and to complete their higher computational demands within the same or even more stringent timing constraints, we must exploit the internal parallelism of jobs, where individual jobs are parallel programs and can potentially utilize more than one core in parallel. However, there is little work considering scheduling multiple parallel jobs that are latency-critical. This dissertation focuses on developing new scheduling strategies, analysis tools, and practical platform design techniques to enable efficient and scalable parallel real-time scheduling for latency-critical applications on multicore systems. In particular, the research is focused on two types of systems: (1) static real-time systems for tasks with deadlines where the temporal properties of the tasks that need to execute is known a priori and the goal is to guarantee the temporal correctness of the tasks prior to their executions; and (2) online systems for latency-critical jobs where multiple jobs arrive over time and the goal to optimize for a performance objective of jobs during the execution. For static real-time systems for parallel tasks, several scheduling strategies, including global earliest deadline first, global rate monotonic and a novel federated scheduling, are proposed, analyzed and implemented. These scheduling strategies have the best known theoretical performance for parallel real-time tasks under any global strategy, any fixed priority scheduling and any scheduling strategy, respectively. In addition, federated scheduling is generalized to systems with multiple criticality levels and systems with stochastic tasks. Both numerical and empirical experiments show that federated scheduling and its variations have good schedulability performance and are efficient in practice. For online systems with multiple latency-critical jobs, different online scheduling strategies are proposed and analyzed for different objectives, including maximizing the number of jobs meeting a target latency, maximizing the profit of jobs, minimizing the maximum latency and minimizing the average latency. For example, a simple First-In-First-Out scheduler is proven to be scalable for minimizing the maximum latency. Based on this theoretical intuition, a more practical work-stealing scheduler is developed, analyzed and implemented. Empirical evaluations indicate that, on both real world and synthetic workloads, this work-stealing implementation performs almost as well as an optimal scheduler

    Toward Efficient Scheduling for Parallel Real-Time Tasks on Multiprocessors

    Get PDF
    Modern real-time applications are becoming more demanding computationally while their temporal requirements, dictated by the physical world, often remain unchanged. This coupled with the increasing prevalence of multiprocessors in real-time systems necessitates that highly computation-demanding real-time tasks need to be parallelized to exploit the parallelism offered by the underlying hardware, in order to satisfy their temporal constraints. Scheduling parallel real-time tasks, however, introduces a new layer of complexity due to the allowance for intra-task parallelism. This dissertation addresses the problem of scheduling parallel real-time tasks in which tasks may (or may not) access shared non-processor resources, such as in-memory buffers or data structures. Specifically, for independent tasks, we propose new scheduling algorithms and schedulability analyses for parallel tasks with these characteristics, under federated and global scheduling.Experimental results show that the proposed algorithms and analyses improve the previously introduced methods. For parallel tasks that may access shared non-processor resources, we present a blocking analysis for two different types of spinlocks; through evaluations, we make a recommendation for a preferable ordering of locks. We also study practical runtime parallel scheduler designs for soft real-time applications and present a design that is more suitable for soft real-time systems

    MetaFork: A Compilation Framework for Concurrency Models Targeting Hardware Accelerators

    Get PDF
    Parallel programming is gaining ground in various domains due to the tremendous computational power that it brings; however, it also requires a substantial code crafting effort to achieve performance improvement. Unfortunately, in most cases, performance tuning has to be accomplished manually by programmers. We argue that automated tuning is necessary due to the combination of the following factors. First, code optimization is machine-dependent. That is, optimization preferred on one machine may be not suitable for another machine. Second, as the possible optimization search space increases, manually finding an optimized configuration is hard. Therefore, developing new compiler techniques for optimizing applications is of considerable interest. This thesis aims at generating new techniques that will help programmers develop efficient algorithms and code targeting hardware acceleration technologies, in a more effective manner. Our work is organized around a compilation framework, called MetaFork, for concurrency platforms and its application to automatic parallelization. MetaFork is a high-level programming language extending C/C++, which combines several models of concurrency including fork-join, SIMD and pipelining parallelism. MetaFork is also a compilation framework which aims at facilitating the design and implementation of concurrent programs through four key features which make MetaFork unique and novel: (1) Perform automatic code translation between concurrency platforms targeting multi-core architectures. (2) Provide a high-level language for expressing concurrency as in the fork-join model, the SIMD paradigm and the pipelining parallelism. (3) Generate parallel code from serial code with an emphasis on code depending on machine or program parameters (e.g. cache size, number of processors, number of threads per thread block). (4) Optimize code depending on parameters that are unknown at compile-time

    Easier Parallel Programming with Provably-Efficient Runtime Schedulers

    Get PDF
    Over the past decade processor manufacturers have pivoted from increasing uniprocessor performance to multicore architectures. However, utilizing this computational power has proved challenging for software developers. Many concurrency platforms and languages have emerged to address parallel programming challenges, yet writing correct and performant parallel code retains a reputation of being one of the hardest tasks a programmer can undertake. This dissertation will study how runtime scheduling systems can be used to make parallel programming easier. We address the difficulty in writing parallel data structures, automatically finding shared memory bugs, and reproducing non-deterministic synchronization bugs. Each of the systems presented depends on a novel runtime system which provides strong theoretical performance guarantees and performs well in practice

    スケジューリング遅延に基づいたタスク並列ランタイムシステムの性能差の解析

    Get PDF
    学位の種別: 課程博士審査委員会委員 : (主査)東京大学准教授 豊田 正史, 東京大学教授 田浦 健次朗, 東京大学准教授 入江 英嗣, 東京大学教授 中島 研吾, 理化学研究所チームリーダ 佐藤 三久, 東京工業大学准教授 横田 理央University of Tokyo(東京大学

    The Cilkprof Scalability Profiler

    No full text
    Cilkprof is a scalability profiler for multithreaded Cilk computations. Unlike its predecessor Cilkview, which analyzes only the whole-program scalability of a Cilk computation, Cilkprof collects work (serial running time) and span (critical-path length) data for each call site in the computation to assess how much each call site contributes to the overall work and span. Profiling work and span in this way enables a programmer to quickly diagnose scalability bottlenecks in a Cilk program. Despite the detail and quantity of information required to collect these measurements, Cilkprof runs with only constant asymptotic slowdown over the serial running time of the parallel computation. As an example of Cilkprof's usefulness, we used Cilkprof to diagnose a scalability bottleneck in an 1800-line parallel breadth-first search (PBFS) code. By examining Cilkprof's output in tandem with the source code, we were able to zero in on a call site within the PBFS routine that imposed a scalability bottleneck. A minor code modification then improved the parallelism of PBFS by a factor of 5. Using Cilkprof, it took us less than two hours to find and fix a scalability bug which had, until then, eluded us for months. This paper describes the Cilkprof algorithm and proves theoretically using an amortization argument that Cilkprof incurs only constant overhead compared with the application's native serial running time. Cilkprof was implemented by compiler instrumentation, that is, by modifying the LLVM compiler to insert instrumentation into user programs. On a suite of 16 application benchmarks, Cilkprof incurs a geometric-mean multiplicative overhead of only 1.9 and a maximum multiplicative overhead of only 7.4 compared with running the benchmarks without instrumentation

    Cilkprof : a scalability profiler for Cilk programs

    No full text
    Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2014.Cataloged from PDF version of thesis.Includes bibliographical references (pages 51-53).This thesis describes the design and implementation of Cilkprof, a profiling tool that helps programmers to diagnose scalability problems in their Cilk programs. Cilkprof provides in-depth information about the scalability of programs, without adding excessive overhead. Cilkprof's output can be used to find scalability bottlenecks in the user's code. Cilkprof makes profiling measurements at the fork and join points of computations, which typically limits the amount of overhead incurred by the profiler. In addition, despite recording in-depth information, Cilkprof does not generate large log files that are typical of trace-based profilers. In addition, the profiling algorithm only incurs constant amortized overhead per measurement. CilkProf slows down the serial program execution by a factor of about 10 in the common case, on a well-coarsened parallel program. The slowdown is reasonable for the amount of information gained from the profiling. Finally, the approach taken by Cilkprof enables the creation of an API, which can allow users to specify their own profiling code, without having to change the Cilk runtime.by Kerry Xing.M. Eng
    corecore