129 research outputs found

    Fast Dynamic Arrays

    Get PDF
    We present a highly optimized implementation of tiered vectors, a data structure for maintaining a sequence of nn elements supporting access in time O(1)O(1) and insertion and deletion in time O(nϵ)O(n^\epsilon) for ϵ>0\epsilon > 0 while using o(n)o(n) extra space. We consider several different implementation optimizations in C++ and compare their performance to that of vector and multiset from the standard library on sequences with up to 10810^8 elements. Our fastest implementation uses much less space than multiset while providing speedups of 40×40\times for access operations compared to multiset and speedups of 10.000×10.000\times compared to vector for insertion and deletion operations while being competitive with both data structures for all other operations

    Achieving High Performance and High Productivity in Next Generational Parallel Programming Languages

    Get PDF
    Processor design has turned toward parallelism and heterogeneity cores to achieve performance and energy efficiency. Developers find high-level languages attractive because they use abstraction to offer productivity and portability over hardware complexities. To achieve performance, some modern implementations of high-level languages use work-stealing scheduling for load balancing of dynamically created tasks. Work-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle hardware busy while relieving overloaded hardware of its burden. However, work-stealing comes with substantial overheads. These overheads arise as a necessary side effect of the implementation and hamper parallel performance. In addition to runtime-imposed overheads, there is a substantial cognitive load associated with ensuring that parallel code is data-race free. This dissertation explores the overheads associated with achieving high performance parallelism in modern high-level languages. My thesis is that, by exploiting existing underlying mechanisms of managed runtimes; and by extending existing language design, high-level languages will be able to deliver productivity and parallel performance at the levels necessary for widespread uptake. The key contributions of my thesis are: 1) a detailed analysis of the key sources of overhead associated with a work-stealing runtime, namely sequential and dynamic overheads; 2) novel techniques to reduce these overheads that use rich features of managed runtimes such as the yieldpoint mechanism, on-stack replacement, dynamic code-patching, exception handling support, and return barriers; 3) comprehensive analysis of the resulting benefits, which demonstrate that work-stealing overheads can be significantly reduced, leading to substantial performance improvements; and 4) a small set of language extensions that achieve both high performance and high productivity with minimal programmer effort. A managed runtime forms the backbone of any modern implementation of a high-level language. Managed runtimes enjoy the benefits of a long history of research and their implementations are highly optimized. My thesis demonstrates that converging these highly optimized features together with the expressiveness of high-level languages, gives further hope for achieving high performance and high productivity on modern parallel hardwar

    Scheduling and synchronization for multicore concurrency platforms

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 217-230).Developing correct and efficient parallel programs is difficult since programmers often have to manage low-level details like scheduling and synchronization explicitly. Recently, however, many hardware vendors have been shifting towards building multicore computers. This trend creates an enormous pressure to create concurrency platforms - platforms that provide an easier interface for parallel programming and enable ordinary programmers to write scalable, portable and efficient parallel programs. This thesis provides some provably-good practical solutions to problems that arise in the implementation of concurrency platforms, particularly in the domain of scheduling and synchronization. The first part of this thesis describes work on scheduling of parallel programs written in dynamic multithreaded languages (such as Cilk, Hood etc.). These languages allow the programmer to express parallelism of their code in a natural manner, while an automatic scheduler in the concurrency platform is responsible for scheduling the program on the underlying parallel hardware. This thesis presents designs to increase the functionality of these concurrency platforms. The second part of the thesis presents work on transactional memory semantics and design. Transactional memory (TM), has been recently proposed as an alternative to locks. TM provides a transactional interface to memory. The programmers can specify their critical sections inside a transaction, and the TM concurrency platform guarantees that the region executes atomically. One of the purported advantages of TM over locks is that transactional code is composable.(cont.) Most of the current TM concurrency platforms do not support full composability, however. This thesis addresses two of the composability problems in existing TM concurrency platforms.by Kunal Agrawal.Ph.D

    RE-LANG---A Parallel-by-default Programming Language

    Get PDF
    In recent years, programming language features such as lightweight threads have gained popularity in the software development workflow. Our research takes a critical look at these recent trends, rethinking them through an academic lens. We propose a construct called "smart assignment," supported by rewriting semantics, which enables a novel parallel-by-default programming paradigm. We present a new programming language—RE-LANG—that implements this feature. Specifically, we demonstrate how the design philosophy of RE-LANG makes imperative, parallel programming more developer-friendly. We discuss the implementation of the language and showcase performance benchmarks, as well as overhead analysis, to demonstrate its efficiency.Doctor of Philosoph

    Fast Dynamic Arrays

    Get PDF
    We present a highly optimized implementation of tiered vectors, a data structure for maintaining a sequence of n elements supporting access in time O(1) and insertion and deletion in time O(n^e) for e > 0 while using o(n) extra space. We consider several different implementation optimizations in C++ and compare their performance to that of vector and set from the standard library on sequences with up to 10^8 elements. Our fastest implementation uses much less space than set while providing speedups of 40x for access operations compared to set and speedups of 10.000x compared to vector for insertion and deletion operations while being competitive with both data structures for all other operations

    A Scalable Locality-aware Adaptive Work-stealing Scheduler for Multi-core Task Parallelism

    Get PDF
    Recent trend has made it clear that the processor makers are committed to the multicore chip designs. The number of cores per chip is increasing, while there is little or no increase in the clock speed. This parallelism trend poses a significant and urgent challenge on computer software because programs have to be written or transformed into a multi-threaded form to take full advantage of future hardware advances. Task parallelism has been identified as one of the prerequisites for software productivity. In task parallelism, programmers focus on decomposing the problem into subcomputations that can run in parallel and leave the compiler and runtime to handle the scheduling details. This separation of concerns between task decomposition and scheduling provides productivity to the programmer but poses challenges to the runtime scheduler. Our thesis is that work-stealing schedulers with adaptive scheduling policies and locality-awareness can provide a scalable and robust runtime foundation for multicore task parallelism. We evaluate our thesis using the new Scalable Locality-aware Adaptive Work-stealing (SLAW) runtime scheduler developed for the Habanero-Java programming language, a task-parallel variant of Java. SLAW's adaptive task scheduling is motivated by the study of two common scheduling policies in a work-stealing scheduler, specifically, the work-first and the help-first policy. Both policies exhibit limitations in performance and resource usage in different situations. The variances make it hard to determine the best policy a priori. SLAW addresses these limitations by supporting both policies simultaneously and selecting policies adaptively on a per-task basis at runtime. Our results show that SLAW achieves O.98x to 9.2x speedup over the help-first scheduler and O.97x to 4.5x speedup over the work-first scheduler. Further, for large irregular parallel computations, SLAW supports data sizes and achieves performance that cannot be delivered by the use of any single fixed policy. SLAW's locality-aware scheduling framework aims to overcome the cache unfriendliness of work-stealing due to randomized stealing. The SLAW scheduler is designed for programming models where locality hints are provided to the runtime by the programmer or compiler. Our results show that locality-aware scheduling can improve performance by increasing temporal data reuse for iterative data-parallel applications

    Programming Models and Runtimes for Heterogeneous Systems

    Get PDF
    With the plateauing of processor frequencies and increase in energy consumption in computing, application developers are seeking new sources of performance acceleration. Heterogeneous platforms with multiple processor architectures offer one possible avenue to address these challenges. However, modern heterogeneous programming models tend to be either so low-level as to severely hinder programmer productivity, or so high-level as to limit optimization opportunities. The novel systems presented in this thesis strike a better balance between abstraction and transparency, enabling programmers to be productive and produce high-performance applications on heterogeneous platforms. This thesis starts by summarizing the strengths, weaknesses, and features of existing heterogeneous programming models. It then introduces and evaluates four novel heterogeneous programming models and runtime systems: JCUDA, CnC-CUDA, DyGR, and HadoopCL. We'll conclude by positioning the key contributions of each piece in this thesis relative to the state-of-the-art, and outline possible directions for future work

    High-level synthesis of fine-grained weakly consistent C concurrency

    Get PDF
    High-level synthesis (HLS) is the process of automatically compiling high-level programs into a netlist (collection of gates). Given an input program, HLS tools exploit its inherent parallelism and pipelining opportunities to generate efficient customised hardware. C-based programs are the most popular input for HLS tools, but these tools historically only synthesise sequential C programs. As the appeal for software concurrency rises, HLS tools are beginning to synthesise concurrent C programs, such as C/C++ pthreads and OpenCL. Although supporting software concurrency leads to better hardware parallelism, shared memory synchronisation is typically serialised to ensure correct memory behaviour, via locks. Locks are safety resources that ensure exclusive access of shared memory, eliminating data races and providing synchronisation guarantees for programmers.  As an alternative to lock-based synchronisation, the C memory model also defines the possibility of lock-free synchronisation via fine-grained atomic operations (`atomics'). However, most HLS tools either do not support atomics at all or implement atomics using locks. Instead, we treat the synthesis of atomics as a scheduling problem. We show that we can augment the intra-thread memory constraints during memory scheduling of concurrent programs to support atomics. On average, hardware generated by our method is 7.5x faster than the state-of-the-art, for our set of experiments. Our method of synthesising atomics enables several unique possibilities. Chiefly, we are capable of supporting weakly consistent (`weak') atomics, which necessitate fewer ordering constraints compared to sequentially consistent (SC) atomics. However, implementing weak atomics is complex and error-prone and hence we formally verify our methods via automated model checking to ensure our generated hardware is correct. Furthermore, since the C memory model defines memory behaviour globally, we can globally analyse the entire program to generate its memory constraints. Additionally, we can also support loop pipelining by extending our methods to generate inter-iteration memory constraints. On average, weak atomics, global analysis and loop pipelining improve performance by 1.6x, 3.4x and 1.4x respectively, for our set of experiments. Finally, we present a case study of a real-world example via an HLS-based Google PageRank algorithm, whose performance improves by 4.4x via lock-free streaming and work-stealing.Open Acces

    3rd Many-core Applications Research Community (MARC) Symposium. (KIT Scientific Reports ; 7598)

    Get PDF
    This manuscript includes recent scientific work regarding the Intel Single Chip Cloud computer and describes approaches for novel approaches for programming and run-time organization
    • …
    corecore