881 research outputs found

    Lock-free Concurrent Data Structures

    Full text link
    Concurrent data structures are the data sharing side of parallel programming. Data structures give the means to the program to store data, but also provide operations to the program to access and manipulate these data. These operations are implemented through algorithms that have to be efficient. In the sequential setting, data structures are crucially important for the performance of the respective computation. In the parallel programming setting, their importance becomes more crucial because of the increased use of data and resource sharing for utilizing parallelism. The first and main goal of this chapter is to provide a sufficient background and intuition to help the interested reader to navigate in the complex research area of lock-free data structures. The second goal is to offer the programmer familiarity to the subject that will allow her to use truly concurrent methods.Comment: To appear in "Programming Multi-core and Many-core Computing Systems", eds. S. Pllana and F. Xhafa, Wiley Series on Parallel and Distributed Computin

    Multicore XINU

    Get PDF
    Multicore architectures employ multiple processing cores that work together for greater processing power. Shared memory, symmetric multiprocessor (SMP) systems are ubiquitous. All software must be explicitly designed to support SMP processing, including operating systems. XINU is a simple, lightweight, elegant operating system that has existed for several decades and has been ported to many platforms. However, XINU has never been extended to support multicore processing. This project incrementally adds SMP support to the XINU operating system. Core kernel modules, including the scheduler and memory manager, have been successfully extended without overhauling or completely redesigning XINU. A multicore methodology is laid out for the remaining kernel modules

    Analysis and implementation of the multiprocessor bandwidth inheritance protocol

    Get PDF
    The Multiprocessor Bandwidth Inheritance (M-BWI) protocol is an extension of the Bandwidth Inheritance (BWI) protocol for symmetric multiprocessor systems. Similar to Priority Inheritance, M-BWI lets a task that has locked a resource execute in the resource reservations of the blocked tasks, thus reducing their blocking time. The protocol is particularly suitable for open systems where different kinds of tasks dynamically arrive and leave, because it guarantees temporal isolation among independent subsets of tasks without requiring any information on their temporal parameters. Additionally, if the temporal parameters of the interacting tasks are known, it is possible to compute an upper bound to the interference suffered by a task due to other interacting tasks. Thus, it is possible to provide timing guarantees for a subset of interacting hard real-time tasks. Finally, the M-BWI protocol is neutral to the underlying scheduling policy: it can be implemented in global, clustered and semi-partitioned scheduling. After introducing the M-BWI protocol, in this paper we formally prove its isolation properties, and propose an algorithm to compute an upper bound to the interference suffered by a task. Then, we describe our implementation of the protocol for the LITMUS RT real-time testbed, and measure its overhead. Finally, we compare M-BWI against FMLP and OMLP, two other protocols for resource sharing in multiprocessor systems

    Global synchronization algorithms for the Intel iPSC/860

    Get PDF
    In a distributed memory multicomputer that has no global clock, global processor synchronization can only be achieved through software. Global synchronization algorithms are used in tridiagonal systems solvers, CFD codes, sequence comparison algorithms, and sorting algorithms. They are also useful for event simulation, debugging, and for solving mutual exclusion problems. For the Intel iPSC/860 in particular, global synchronization can be used to ensure the most effective use of the communication network for operations such as the shift, where each processor in a one-dimensional array or ring concurrently sends a message to its right (or left) neighbor. Three global synchronization algorithms are considered for the iPSC/860: the gysnc() primitive provided by Intel, the PICL primitive sync0(), and a new recursive doubling synchronization (RDS) algorithm. The performance of these algorithms is compared to the performance predicted by communication models of both the long and forced message protocols. Measurements of the cost of shift operations preceded by global synchronization show that the RDS algorithm always synchronizes the nodes more precisely and costs only slightly more than the other two algorithms

    An approach to task-based parallel programming for undergraduate students

    Get PDF
    This paper presents the description of a compulsory parallel programming course in the bachelor degree in Informatics Engineering at the Barcelona School of Informatics, Universitat Politècnica de Catalunya UPC-BarcelonaTech. The main focus of the course is on the shared-memory programming paradigm, which facilitates the presentation of fundamental aspects and notions of parallel computing. Unlike the “traditional” loop-based approach, which is the focus of parallel programming courses in other universities, this course presents the parallel programming concepts using a task-based approach. Tasking allows students to explore a broader set of parallel decomposition strategies, including linear, iterative and recursive strategies, and their implementation using the current version of OpenMP (OpenMP 4.5), which offers mechanisms (pragmas and intrinsic functions) to easily map these strategies into parallel programs. Simple models to understand the benefits of a task decomposition and the trade-offs introduced by different kinds of overheads are included in the course, together with the use of tools that allow an easy exploration of different task decomposition strategies and their potential parallelism (Tareador) and instrumentation and analysis of task parallel executions on real machines (Extrae and Paraver).This work has been supported by the grant SEV-2015-0493 of the Severo Ochoa Program, awarded by the Spanish Gov- ernment, by the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P) and by Generalitat de Catalunya (contracts 2014-MOOC-00057 and 2014-SGR-1051). We also thank the anonymous reviewers and editor for their comments during the review process, other professors that have been in- volved in the implementation of the course and Paul Carpenter at BSC for his corrections and suggestions to improve the text.Postprint (published version

    Evaluating Speedup in Parallel Compilers

    Get PDF
    Parallel programming is prevalent in every field mainly to speed up computation. Advancements in multiprocessor technology fuel this trend toward parallel programming. However, modern compilers are still largely single threaded and do not take advantage of the machine resources available to them. There has been a lot of work done on compilers that add parallel constructs to the programs they are compiling, enabling programs to exploit parallelism at run time. Auto parallelization of loops by a compiler is one such example. Researchers have done very little work towards parallelizing the compilation process itself. The research done here focuses on parallel compilers that target computation speedup by parallelizing the process of program compilation during the lexical analysis and semantic analysis phase. Parallelization brings along with it issues like synchronization, concurrency and communication overhead. In the semantic analysis phase, these issues are of particular relevance during the construction of the symbol table. Research done on a concurrent compiler developed at the University of Toronto in 1991 proposed three techniques to address the generation of the symbol table [Seshadri91]. The goal here is to implement a parallel compiler using concepts from those techniques as references. The research done here will augment the work done formerly and measure the performance speedup obtained
    • …
    corecore