19 research outputs found

    Dynamic Determinacy Race Detection for Task-Parallel Programs with Promises

    Get PDF
    Much of the past work on dynamic data-race and determinacy-race detection algorithms for task parallelism has focused on structured parallelism with fork-join constructs and, more recently, with future constructs. This paper addresses the problem of dynamic detection of data-races and determinacy-races in task-parallel programs with promises, which are more general than fork-join constructs and futures. The motivation for our work is twofold. First, promises have now become a mainstream synchronization construct, with their inclusion in multiple languages, including C++, JavaScript, and Java. Second, past work on dynamic data-race and determinacy-race detection for task-parallel programs does not apply to programs with promises, thereby identifying a vital need for this work. This paper makes multiple contributions. First, we introduce a featherweight programming language that captures the semantics of task-parallel programs with promises and provides a basis for formally defining determinacy using our semantics. This definition subsumes functional determinacy (same output for same input) and structural determinacy (same computation graph for same input). The main theoretical result shows that the absence of data races is sufficient to guarantee determinacy with both properties. We are unaware of any prior work that established this result for task-parallel programs with promises. Next, we introduce a new Dynamic Race Detector for Promises that we call DRDP. DRDP is the first known race detection algorithm that executes a task-parallel program sequentially without requiring the serial-projection property; this is a critical requirement since programs with promises do not satisfy the serial-projection property in general. Finally, the paper includes experimental results obtained from an implementation of DRDP. The results show that, with some important optimizations introduced in our work, the space and time overheads of DRDP are comparable to those of more restrictive race detection algorithms from past work. To the best of our knowledge, DRDP is the first determinacy race detector for task-parallel programs with promises

    A toolchain to verify the parallelization of OmpSs-2 applications

    Get PDF
    Programming models for task-based parallelization based on compile-time directives are very effective at uncovering the parallelism available in HPC applications. Despite that, the process of correctly annotating complex applications is error-prone and may hinder the general adoption of these models. In this paper, we target the OmpSs-2 programming model and present a novel toolchain able to detect parallelization errors coming from non-compliant OmpSs-2 applications. Our toolchain verifies the compliance with the OmpSs-2 programming model using local task analysis to deal with each task separately, and structural induction to extend the analysis to the whole program. To improve the effectiveness of our tools, we also introduce some ad-hoc verification annotations, which can be used manually or automatically to disable the analysis of specific code regions. Experiments run on a sample of representative kernels and applications show that our toolchain can be successfully used to verify the parallelization of complex real-world applications.This project is supported by the European Union’s Horizon 2021 research and innovation programme under grant agreement No 754304 (DEEP-EST), by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 871669 (AMPERE) and the Project HPCEUROPA3 (INFRAIA-2016-1-730897), by the Ministry of Economy of Spain through the Severo Ochoa Center of Excellence Program (SEV-2015-0493), by the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P), and by the Generalitat de Catalunya (2017-SGR-1481).Peer ReviewedPostprint (author's final draft

    Transactions Everywhere

    Get PDF
    Arguably, one of the biggest deterrants for software developers who might otherwise choose to write parallel code is that parallelism makes their lives more complicated. Perhaps the most basic problem inherent in the coordination of concurrent tasks is the enforcing of atomicity so that the partial results of one task do not inadvertently corrupt another task. Atomicity is typically enforced through locking protocols, but these protocols can introduce other complications, such as deadlock, unless restrictive methodologies in their use are adopted. We have recently begun a research project focusing on transactional memory [18] as an alternative mechanism for enforcing atomicity, since it allows the user to avoid many of the complications inherent in locking protocols. Rather than viewing transactions as infrequent occurrences in a program, as has generally been done in the past, we have adopted the point of view that all user code should execute in the context of some transaction. To make this viewpoint viable requires the development of two key technologies: effective hardware support for scalable transactional memory, and linguistic and compiler support. This paper describes our preliminary research results on making “transactions everywhere” a practical reality.Singapore-MIT Alliance (SMA

    Efficient Race Detection with Futures

    Full text link
    This paper addresses the problem of provably efficient and practically good on-the-fly determinacy race detection in task parallel programs that use futures. Prior works determinacy race detection have mostly focused on either task parallel programs that follow a series-parallel dependence structure or ones with unrestricted use of futures that generate arbitrary dependences. In this work, we consider a restricted use of futures and show that it can be race detected more efficiently than general use of futures. Specifically, we present two algorithms: MultiBags and MultiBags+. MultiBags targets programs that use futures in a restricted fashion and runs in time O(T1α(m,n))O(T_1 \alpha(m,n)), where T1T_1 is the sequential running time of the program, α\alpha is the inverse Ackermann's function, mm is the total number of memory accesses, nn is the dynamic count of places at which parallelism is created. Since α\alpha is a very slowly growing function (upper bounded by 44 for all practical purposes), it can be treated as a close-to-constant overhead. MultiBags+ an extension of MultiBags that target programs with general use of futures. It runs in time O((T1+k2)α(m,n))O((T_1+k^2)\alpha(m,n)) where T1T_1, α\alpha, mm and nn are defined as before, and kk is the number of future operations in the computation. We implemented both algorithms and empirically demonstrate their efficiency

    Easier Parallel Programming with Provably-Efficient Runtime Schedulers

    Get PDF
    Over the past decade processor manufacturers have pivoted from increasing uniprocessor performance to multicore architectures. However, utilizing this computational power has proved challenging for software developers. Many concurrency platforms and languages have emerged to address parallel programming challenges, yet writing correct and performant parallel code retains a reputation of being one of the hardest tasks a programmer can undertake. This dissertation will study how runtime scheduling systems can be used to make parallel programming easier. We address the difficulty in writing parallel data structures, automatically finding shared memory bugs, and reproducing non-deterministic synchronization bugs. Each of the systems presented depends on a novel runtime system which provides strong theoretical performance guarantees and performs well in practice

    Efficient Data Race Detection for Async-Finish Parallelism

    Full text link
    Abstract. A major productivity hurdle for parallel programming is the presence of data races. Data races can lead to all kinds of harmful program behaviors, includ-ing determinism violations and corrupted memory. However, runtime overheads of current dynamic data race detectors are still prohibitively large (often incurring slowdowns of 10 × or larger) for use in mainstream software development. In this paper, we present an efficient dynamic race detector algorithm targeting the async-finish task-parallel parallel programming model. The async and finish constructs are at the core of languages such as X10 and Habanero Java (HJ). These constructs generalize the spawn-sync constructs used in Cilk, while still ensuring that all computation graphs are deadlock-free. We have implemented our algorithm in a tool called TASKCHECKER and eval-uated it on a suite of 12 benchmarks. To reduce overhead of the dynamic analysis, we have also implemented various static optimizations in the tool. Our experi-mental results indicate that our approach performs well in practice, incurring an average slowdown of 3.05 × compared to a serial execution in the optimized case.

    Provably and Practically Efficient Race Detection for Task-Parallel Code

    Get PDF
    Parallel systems are pervasive nowadays. Specifically, modern computers have embraced multicore architectures due to the difficulties of exploiting higher clock speeds on single-core CPUs. However, parallel programming is challenging. Determinacy race, in particular, is a common pitfall when writing task-parallel code. It can easily lead to non-deterministic behavior of the parallel program and therefore a determinacy race is often considered as a bug. Unfortunately, such bugs are hard to debug because they do not necessarily produce obvious failures in every single execution. To ease the debugging process of determinacy races in task-parallel code, this dissertation proposes several provably and practically efficient parallel race detection algorithms. Unlike prior works mostly target fork-join parallelism, we focus on less structured but important programming paradigms – pipeline parallelism and futures. In addition, we build an efficient runtime system for scheduling futures, which is not only a facility to study the race detection problem for futures but also useful in practice. Finally, this dissertation investigates mechanisms that optimize the access history of race detectors, which provides significant additional boost to the performance

    Running stream-like programs on heterogeneous multi-core systems

    Get PDF
    All major semiconductor companies are now shipping multi-cores. Phones, PCs, laptops, and mobile internet devices will all require software that can make effective use of these cores. Writing high-performance parallel software is difficult, time-consuming and error prone, increasing both time-to-market and cost. Software outlives hardware; it typically takes longer to develop new software than hardware, and legacy software tends to survive for a long time, during which the number of cores per system will increase. Development and maintenance productivity will be improved if parallelism and technical details are managed by the machine, while the programmer reasons about the application as a whole. Parallel software should be written using domain-specific high-level languages or extensions. These languages reveal implicit parallelism, which would be obscured by a sequential language such as C. When memory allocation and program control are managed by the compiler, the program's structure and data layout can be safely and reliably modified by high-level compiler transformations. One important application domain contains so-called stream programs, which are structured as independent kernels interacting only through one-way channels, called streams. Stream programming is not applicable to all programs, but it arises naturally in audio and video encode and decode, 3D graphics, and digital signal processing. This representation enables high-level transformations, including kernel unrolling and kernel fusion. This thesis develops new compiler and run-time techniques for stream programming. The first part of the thesis is concerned with a statically scheduled stream compiler. It introduces a new static partitioning algorithm, which determines which kernels should be fused, in order to balance the loads on the processors and interconnects. A good partitioning algorithm is crucial if the compiler is to produce efficient code. The algorithm also takes account of downstream compiler passes---specifically software pipelining and buffer allocation---and it models the compiler's ability to fuse kernels. The latter is important because the compiler may not be able to fuse arbitrary collections of kernels. This thesis also introduces a static queue sizing algorithm. This algorithm is important when memory is distributed, especially when local stores are small. The algorithm takes account of latencies and variations in computation time, and is constrained by the sizes of the local memories. The second part of this thesis is concerned with dynamic scheduling of stream programs. First, it investigates the performance of known online, non-preemptive, non-clairvoyant dynamic schedulers. Second, it proposes two dynamic schedulers for stream programs. The first is specifically for one-dimensional stream programs. The second is more general: it does not need to be told the stream graph, but it has slightly larger overhead. This thesis also introduces some support tools related to stream programming. StarssCheck is a debugging tool, based on Valgrind, for the StarSs task-parallel programming language. It generates a warning whenever the program's behaviour contradicts a pragma annotation. Such behaviour could otherwise lead to exceptions or race conditions. StreamIt to OmpSs is a tool to convert a streaming program in the StreamIt language into a dynamically scheduled task based program using StarSs.Totes les empreses de semiconductors produeixen actualment multi-cores. Mòbils,PCs, portàtils, i dispositius mòbils d’Internet necessitaran programari quefaci servir eficientment aquests cores. Escriure programari paral·lel d’altrendiment és difícil, laboriós i propens a errors, incrementant tant el tempsde llançament al mercat com el cost. El programari té una vida més llarga queel maquinari; típicament pren més temps desenvolupar nou programi que noumaquinari, i el programari ja existent pot perdurar molt temps, durant el qualel nombre de cores dels sistemes incrementarà. La productivitat dedesenvolupament i manteniment millorarà si el paral·lelisme i els detallstècnics són gestionats per la màquina, mentre el programador raona sobre elconjunt de l’aplicació.El programari paral·lel hauria de ser escrit en llenguatges específics deldomini. Aquests llenguatges extrauen paral·lelisme implícit, el qual és ocultatper un llenguatge seqüencial com C. Quan l’assignació de memòria i lesestructures de control són gestionades pel compilador, l’estructura iorganització de dades del programi poden ser modificades de manera segura ifiable per les transformacions d’alt nivell del compilador.Un dels dominis de l’aplicació importants és el que consta dels programes destream; aquest programes són estructurats com a nuclis independents queinteractuen només a través de canals d’un sol sentit, anomenats streams. Laprogramació de streams no és aplicable a tots els programes, però sorgeix deforma natural en la codificació i descodificació d’àudio i vídeo, gràfics 3D, iprocessament de senyals digitals. Aquesta representació permet transformacionsd’alt nivell, fins i tot descomposició i fusió de nucli.Aquesta tesi desenvolupa noves tècniques de compilació i sistemes en tempsd’execució per a programació de streams. La primera part d’aquesta tesi esfocalitza amb un compilador de streams de planificació estàtica. Presenta unnou algorisme de partició estàtica, que determina quins nuclis han de serfusionats, per tal d’equilibrar la càrrega en els processadors i en lesinterconnexions. Un bon algorisme de particionat és fonamental per tal de queel compilador produeixi codi eficient. L’algorisme també té en compte elspassos de compilació subseqüents---específicament software pipelining il’arranjament de buffers---i modela la capacitat del compilador per fusionarnuclis. Aquesta tesi també presenta un algorisme estàtic de redimensionament de cues.Aquest algorisme és important quan la memòria és distribuïda, especialment quanles memòries locals són petites. L’algorisme té en compte latències ivariacions en els temps de càlcul, i considera el límit imposat per la mida deles memòries locals.La segona part d’aquesta tesi es centralitza en la planificació dinàmica deprogrames de streams. En primer lloc, investiga el rendiment dels planificadorsdinàmics online, non-preemptive i non-clairvoyant. En segon lloc, proposa dosplanificadors dinàmics per programes de stream. El primer és específicament pera programes de streams unidimensionals. El segon és més general: no necessitael graf de streams, però els overheads són una mica més grans.Aquesta tesi també presenta un conjunt d’eines de suport relacionades amb laprogramació de streams. StarssCheck és una eina de depuració, que és basa enValgrind, per StarSs, un llenguatge de programació paral·lela basat en tasques.Aquesta eina genera un avís cada vegada que el comportament del programa estàen contradicció amb una anotació pragma. Aquest comportament d’una altra manerapodria causar excepcions o situacions de competició. StreamIt to OmpSs és unaeina per convertir un programa de streams codificat en el llenguatge StreamIt aun programa de tasques en StarSs planificat de forma dinàmica.Postprint (published version
    corecore