Search CORE

11 research outputs found

On-the-fly pipeline parallelism

Author: Lee I-Ting Angelina
Leiserson Charles E.
Schardl Tao Benjamin
Sukha Jim
Zhang Zhunping
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2013
Field of study

Pipeline parallelism organizes a parallel program as a linear sequence of s stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism. Whereas most concurrency platforms that support pipeline parallelism use a "construct-and-run" approach, this paper investigates "on-the-fly" pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding "runaway" pipelines. Given a pipeline computation with T[subscript 1] work and T[subscript ∞] span (critical-path length), Piper executes the computation on P processors in T[subscript P]≤ T[subscript 1]/P + O(T[subscript ∞] + lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time. We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as lazy enabling and dependency folding. We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.National Science Foundation (U.S.) (Grant CNS-1017058)National Science Foundation (U.S.) (Grant CCF-1162148)National Science Foundation (U.S.). Graduate Research Fellowshi

CiteSeerX

DSpace@MIT

Crossref

A lisp oriented architecture

Author: McClain John W. F. (John Wesley Ferguson)
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/1994
Field of study

Thesis (M.S.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1994.Includes bibliographical references (p. 63-67).by John W.F. McClain.M.S

DSpace@MIT

Shared-Environment Call-by-Need

Author: Stelle George W
Publication venue: UNM Digital Repository
Publication date: 13/05/2019
Field of study

Call-by-need semantics formalize the wisdom that work should be done at most once. It frees programmers to focus more on the correctness of their code, and less on the operational details. Because of this property, programmers of lazy functional languages rely heavily on their compiler to both preserve correctness and generate high-performance code for high level abstractions. In this dissertation I present a novel technique for compiling call-by-need semantics by using shared environments to share results of computation. I show how the approach enables a compiler that generates high-performance code, while staying simple enough to lend itself to formal reasoning. The dissertation is divided into three main contributions. First, I present an abstract machine, the \ce machine, which formalizes the approach. Second, I show that it can be implemented as a native code compiler with encouraging performance results. Finally, I present a verified compiler, implemented in the Coq proof assistant, demonstrating how the simplicity of the approach enables formal verification

Path expressions: a technique for specifying process synchronization

Author: Campbell Roy Harold
Publication venue: Newcastle University
Publication date: 01/01/1976
Field of study

PhD ThesisPath expressions are a new method of describing synchronization which provides a clear and structured approach to the description of shared data and the co-ordination and communication between concurrent processes. This method is flexible in its ability to express synchronization, and may be used in differing forms, some equivalent to P,V operations on counting semaphores. This method of synchronization is presented, and the motivations and considerations from which it is derived are explained. A method for formally characterizing path expressions is given, together with several automatic means of translating path expressions into implementations using existing synchronization operations.The Science Research Council

Newcastle University eTheses

On the comparison of protection systems

Author: Wyeth David
Publication venue: Newcastle University
Publication date: 01/01/1976
Field of study

PhD ThesisA methodology is presented for performing quantitative cost-benefit comparisons of protection systems. Protection systems in both programming languages and machine architectures can be understood and described in terms of the concept of a domain, an abstract entity which defines the access privileges of an executing program to objects in a system. Though the issues of protection and addressing can be treated separately, the realisation of the close relationship between protection and addressing can assist in the implementation of domains using addressing techniques and provides a basis for the comparison of protection systems. Current formal models of protection are seen to aid qualitative comparisons but do not provide an effective yardstick with which to compare protection systems. Based on the ideas of protection through addressing, a protection model is developed from which cost and benefit measures of protection are derived in order to achieve the quantitative comparison methodology. Two detailed examples of the application of the methodology are presented. The first concerns the protection implemented in various Algol W run-time systems, and the second compares the protection system of IBM's 370 DOS/VS operating system with a proposed alternative protection system. Finally, the comparison of protection systems which exploit structure to achieve protection is discussed. The notion of a structured domain is introduced and used in an assessment of the protection afforded by programmer defined types and a supporting architecture.The Science Research Council: The Computing Laboratory, Newcastle University

Newcastle University eTheses

Cilk : efficient multithreaded computing

Author: Randall Keith H. (Keith Harold)
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/1998
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 170-179).by Keith H. Randall.Ph.D

DSpace@MIT

Portable high-performance programs

Author: Frigo Matteo, 1968-
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/1999
Field of study

Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references (p. 159-169).by Matteo Frigo.Ph.D

CiteSeerX

DSpace@MIT

The Cilk system for parallel multithreaded computing

Author: Joerg Christopher F. (Christopher Frank)
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/1996
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1996.Includes bibliographical references (p. 187-199).by Christopher F. Joerg.Ph.D

DSpace@MIT

Memory abstractions for parallel programming

Author: Charles E. Leiserson
I-ting Angelina Lee
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2012
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 156-163).A memory abstraction is an abstraction layer between the program execution and the memory that provides a different "view" of a memory location depending on the execution context in which the memory access is made. Properly designed memory abstractions help ease the task of parallel programming by mitigating the complexity of synchronization or admitting more efficient use of resources. This dissertation describes five memory abstractions for parallel programming: (i) cactus stacks that interoperate with linear stacks, (ii) efficient reducers, (iii) reducer arrays, (iv) ownershipaware transactions, and (v) location-based memory fences. To demonstrate the utility of memory abstractions, my collaborators and I developed Cilk-M, a dynamically multithreaded concurrency platform which embodies the first three memory abstractions. Many dynamic multithreaded concurrency platforms incorporate cactus stacks to support multiple stack views for all the active children simultaneously. The use of cactus stacks, albeit essential, forces concurrency platforms to trade off between performance, memory consumption, and interoperability with serial code due to its incompatibility with linear stacks. This dissertation proposes a new strategy to build a cactus stack using thread-local memory mapping (or TLMM), which enables Cilk-M to satisfy all three criteria simultaneously. A reducer hyperobject allows different branches of a dynamic multithreaded program to maintain coordinated local views of the same nonlocal variable. With reducers, one can use nonlocal variables in a parallel computation without restructuring the code or introducing races. This dissertation introduces memory-mapped reducers, which admits a much more efficient access compared to existing implementations. When used in large quantity, reducers incur unnecessarily high overhead in execution time and space consumption. This dissertation describes support for reducer arrays, which offers the same functionality as an array of reducers with significantly less overhead. Transactional memory is a high-level synchronization mechanism, designed to be easier to use and more composable than fine-grain locking. This dissertation presents ownership-aware transactions, the first transactional memory design that provides provable safety guarantees for "opennested" transactions. On architectures that implement memory models weaker than sequential consistency, programs communicating via shared memory must employ memory-fences to ensure correct execution. This dissertation examines the concept of location-based memoryfences, which unlike traditional memory fences, incurs latency only when synchronization is necessary.by I-Ting Angelina Lee.Ph.D

CiteSeerX

DSpace@MIT