9 research outputs found

    On-the-fly pipeline parallelism

    Get PDF
    Pipeline parallelism organizes a parallel program as a linear sequence of s stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism. Whereas most concurrency platforms that support pipeline parallelism use a "construct-and-run" approach, this paper investigates "on-the-fly" pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding "runaway" pipelines. Given a pipeline computation with T[subscript 1] work and T[subscript ∞] span (critical-path length), Piper executes the computation on P processors in T[subscript P]≤ T[subscript 1]/P + O(T[subscript ∞] + lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time. We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as lazy enabling and dependency folding. We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.National Science Foundation (U.S.) (Grant CNS-1017058)National Science Foundation (U.S.) (Grant CCF-1162148)National Science Foundation (U.S.). Graduate Research Fellowshi

    Finding a Hamiltonian Path in a Cube with Specified Turns is Hard

    Get PDF
    We prove the NP-completeness of finding a Hamiltonian path in an N × N × N cube graph with turns exactly at specified lengths along the path. This result establishes NP-completeness of Snake Cube puzzles: folding a chain of N3 unit cubes, joined at face centers (usually by a cord passing through all the cubes), into an N × N × N cube. Along the way, we prove a universality result that zig-zag chains (which must turn every unit) can fold into any polycube after 4 × 4 × 4 refinement, or into any Hamiltonian polycube after 2 × 2 × 2 refinement

    Folding equilateral plane graphs

    Get PDF
    22nd International Symposium, ISAAC 2011, Yokohama, Japan, December 5-8, 2011. ProceedingsWe consider two types of folding applied to equilateral plane graph linkages. First, under continuous folding motions, we show how to reconfigure any linear equilateral tree (lying on a line) into a canonical configuration. By contrast, such reconfiguration is known to be impossible for linear (nonequilateral) trees and for (nonlinear) equilateral trees. Second, under instantaneous folding motions, we show that an equilateral plane graph has a noncrossing linear folded state if and only if it is bipartite. Not only is the equilateral constraint necessary for this result, but we show that it is strongly NP-complete to decide whether a (nonequilateral) plane graph has a linear folded state. Equivalently, we show strong NP-completeness of deciding whether an abstract metric polyhedral complex with one central vertex has a noncrossing flat folded state with a specified “outside region”. By contrast, the analogous problem for a polyhedral manifold with one central vertex (single-vertex origami) is only weakly NP-complete

    Developing a science of fast code for the post-Moore era

    No full text
    Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (pages 303-328).The end of Moore's Law, which experts predict to occur in as few as 5 years, means that even average programmers will need to be able to write fast code. Software performance engineering offers great promise to provide computer performance gains in the post-Moore era, but developing efficient software today requires substantial expertise and arcane knowledge of hardware and software systems. Multicore processors are particularly challenging to use efficiently, because doing so requires programmers to engage in parallel programming and to deal with nondeterministic program behavior and parallel scalability concerns. I contend that we can remedy the ad hoc and unprincipled nature of software performance engineering by creating simple and integrated programming technologies for writing fast code. This thesis studies how such technologies can be built by examining nine artifacts that enable principled approaches to tackling nondeterminism and scalability concerns in writing efficient multicore software. Five artifacts develop programming models and theories of performance for writing multicore programs that are efficient both in theory and in practice: - PBFS, a work-efficient parallel breadth-first search algorithm. - The Prism chromatic-scheduling algorithm, which executes dynamic data-graph computations deterministically in parallel. - Ordering heuristics for parallel greedy graph coloring algorithms. - The pedigree mechanism and DotMix algorithm for generating pseudorandom numbers deterministically in parallel in dynamic multithreaded programs. - The Cilk-P concurrency platform, which provides linguistic and runtime support for deterministic on-the-fly pipeline parallelism. Three artifacts strive to embed abstract programming and performance models into tools and compilers: - Cilkprof, a profiler that efficiently measures how each call site in a Cilk program contributes to the program's scalability. - Rader, a provably good race detector for Cilk programs that use reducer hyperobjects. - The Tapir compiler intermediate representation, which enables existing compiler optimizations for serial code to optimize across parallel control flow with minimal changes. The final artifact tackles the complexity of creating efficient diagnostic tools: - CSI, a framework that provides comprehensive static instrumentation for efficient dynamic-analysis tools. Together, these artifacts contribute to developing a more coherent science of fast code for multicores than exists today.by Tao Benjamin Schardl.Ph. D

    Design and analysis of a nondeterministic PBFS algorithm

    No full text
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 75-77).I have developed a multithreaded implementation of breadth-first search (BFS) of a sparse graph using the Cilk++ extensions to C++. My PBFS program on a single processor runs as quickly as a standard C++ breadth-first search implementation. PBFS achieves high workefficiency by using a novel implementation of a multiset data structure, called a "bag," in place of the FIFO queue usually employed in serial breadth-first search algorithms. For a variety of benchmark input graphs whose diameters are significantly smaller than the number of vertices - a condition met by many real-world graphs - PBFS demonstrates good speedup with the number of processing cores. Since PBFS employs a nonconstant-time "reducer" - a "hyperobject" feature of Cilk++ - the work inherent in a PBFS execution depends nondeterministically on how the underlying work-stealing scheduler load-balances the computation. I provide a general method for analyzing nondeteriministic programs that use reducers. PBFS also is nondeterministic in that it contains benign races which affect its performance but not its correctness. Fixing these races with mutual-exclusion locks slows down PBFS empirically, but it makes the algorithm amenable to analysis. In particular, I show that for a graph G = (V, E) with diameter D and bounded out-degree. this data-race-free version of PBFS algorithm runs in time O((V +E)/P+DIg[supercript 3] (V/D)) on P processors, which means that it attains near-perfect linear speedup if P < (V +E)/DIg[supercript 3] (V/D).by Tao Benjamin Schardl.M.Eng

    Upper Bounds on Number of Steals in Rooted Trees

    No full text
    Abstract Inspired by applications in parallel computing, we analyze the setting of work stealing in multithreaded computations. We obtain tight upper bounds on the number of steals when the computation can be modeled by rooted trees. In particular, we show that if the computation with n processors starts with one processor having a complete k-ary tree of height h (and the remaining n−1 processors having nothing), the maximum possible number of steals is ∑ni=1(k−1)i(hi)

    On the efficiency of localized work stealing

    No full text
    This paper investigates a variant of the work-stealing algorithm that we call the localized work-stealing algorithm. The intuition behind this variant is that because of locality, processors can benefit from working on their own work. Consequently, when a processor is free, it makes a steal attempt to get back its own work. We call this type of steal a steal-back. We show that the expected running time of the algorithm is T[subscript 1]/P + O(T[subscript ∞]P), and that under the “even distribution of free agents assumption”, the expected running time of the algorithm is T[subscript 1]/P + O(T[subscript ∞]lgP) . In addition, we obtain another running-time bound based on ratios between the sizes of serial tasks in the computation. If M denotes the maximum ratio between the largest and the smallest serial tasks of a processor after removing a total of O(P) serial tasks across all processors from consideration, then the expected running time of the algorithm is T[subscript 1]/ P+ O(T[subscript ∞]M). Keywords: Parallel algorithms; Multihreaded computation; Work stealing; Localizatio

    Executing Dynamic Data-Graph Computations Deterministically Using Chromatic Scheduling

    No full text
    A data-graph computation—popularized by such programming systems as Galois, Pregel, GraphLab, PowerGraph, and GraphChi—is an algorithm that performs local updates on the vertices of a graph. During each round of a data-graph computation, an update function atomically modifies the data associated with a vertex as a function of the vertex’s prior data and that of adjacent vertices. A dynamic data-graph computation updates only an active subset of the vertices during a round, and those updates determine the set of active vertices for the next round. This article introduces Prism, a chromatic-scheduling algorithm for executing dynamic data-graph computations. Prism uses a vertex coloring of the graph to coordinate updates performed in a round, precluding the need for mutual-exclusion locks or other nondeterministic data synchronization. A multibag data structure is used by Prism to maintain a dynamic set of active vertices as an unordered set partitioned by color. We analyze Prism using work-span analysis. Let G = (V, E) be a degree-Δ graph colored with χ colors, and suppose that Q⊆V is the set of active vertices in a round. Define size(Q)= |Q| + ∑v∈ Q deg(v), which is proportional to the space required to store the vertices of Q using a sparse-graph layout. We show that a P-processor execution of Prism performs updates in Q using O(χ (lg ( Q/χ ) + lg Δ ) + lg P span and Θ(size(Q) + P) work. These theoretical guarantees are matched by good empirical performance. To isolate the effect of the scheduling algorithm on performance, we modified GraphLab to incorporate Prism and studied seven application benchmarks on a 12-core multicore machine. Prism executes the benchmarks 1.2 to 2.1 times faster than GraphLab’s nondeterministic lock-based scheduler while providing deterministic behavior. This article also presents Prism-R, a variation of Prism that executes dynamic data-graph computations deterministically even when updates modify global variables with associative operations. Prism-R satisfies the same theoretical bounds as Prism, but its implementation is more involved, incorporating a multivector data structure to maintain a deterministically ordered set of vertices partitioned by color. Despite its additional complexity, Prism-R is only marginally slower than Prism. On the seven application benchmarks studied, Prism-R incurs a 7% geometric mean overhead relative to Prism

    The Cilkprof Scalability Profiler

    No full text
    Cilkprof is a scalability profiler for multithreaded Cilk computations. Unlike its predecessor Cilkview, which analyzes only the whole-program scalability of a Cilk computation, Cilkprof collects work (serial running time) and span (critical-path length) data for each call site in the computation to assess how much each call site contributes to the overall work and span. Profiling work and span in this way enables a programmer to quickly diagnose scalability bottlenecks in a Cilk program. Despite the detail and quantity of information required to collect these measurements, Cilkprof runs with only constant asymptotic slowdown over the serial running time of the parallel computation. As an example of Cilkprof's usefulness, we used Cilkprof to diagnose a scalability bottleneck in an 1800-line parallel breadth-first search (PBFS) code. By examining Cilkprof's output in tandem with the source code, we were able to zero in on a call site within the PBFS routine that imposed a scalability bottleneck. A minor code modification then improved the parallelism of PBFS by a factor of 5. Using Cilkprof, it took us less than two hours to find and fix a scalability bug which had, until then, eluded us for months. This paper describes the Cilkprof algorithm and proves theoretically using an amortization argument that Cilkprof incurs only constant overhead compared with the application's native serial running time. Cilkprof was implemented by compiler instrumentation, that is, by modifying the LLVM compiler to insert instrumentation into user programs. On a suite of 16 application benchmarks, Cilkprof incurs a geometric-mean multiplicative overhead of only 1.9 and a maximum multiplicative overhead of only 7.4 compared with running the benchmarks without instrumentation
    corecore