6 research outputs found

    A Concurrency-Agnostic Protocol for Multi-Paradigm Concurrent Debugging Tools

    Get PDF
    Today's complex software systems combine high-level concurrency models. Each model is used to solve a specific set of problems. Unfortunately, debuggers support only the low-level notions of threads and shared memory, forcing developers to reason about these notions instead of the high-level concurrency models they chose. This paper proposes a concurrency-agnostic debugger protocol that decouples the debugger from the concurrency models employed by the target application. As a result, the underlying language runtime can define custom breakpoints, stepping operations, and execution events for each concurrency model it supports, and a debugger can expose them without having to be specifically adapted. We evaluated the generality of the protocol by applying it to SOMns, a Newspeak implementation, which supports a diversity of concurrency models including communicating sequential processes, communicating event loops, threads and locks, fork/join parallelism, and software transactional memory. We implemented 21 breakpoints and 20 stepping operations for these concurrency models. For none of these, the debugger needed to be changed. Furthermore, we visualize all concurrent interactions independently of a specific concurrency model. To show that tooling for a specific concurrency model is possible, we visualize actor turns and message sends separately.Comment: International Symposium on Dynamic Language

    Debugging Programs that use Atomic Blocks and Transactional Memory

    No full text
    With the emergence of research prototypes, programming using atomic blocks and transactional memory (TM) is becoming more attractive. This paper describes our experience building and using a debugger for programs written with these abstractions. We introduce three approaches: (i) debugging at the level of atomic blocks, where the programmer is shielded from implementation details (such as exactly what kind of TM is used, or indeed whether lock inference is used instead), (ii) debugging at the level of transactions, where conflict rates, read sets, write sets, and other TM internals are visible, and (iii) debug-time transactions, which let the programmer manipulate synchronization from within the debugger— e.g., enlarging the scope of an atomic block to try to identify a bug. In this paper we explain the rationale behind the new debugging approaches that we propose. We describe the design and implementation of an extension to the WinDbg debugger, enabling support for C # programs using atomic blocks and TM. We also demonstrate the design of a “conflict point discovery ” technique for identifying program statements that introduce contention between transactions. We illustrate how these techniques can be used by optimizing a C# version of the Genome application from STAMP TM benchmark suite

    Enhancing the efficiency and practicality of software transactional memory on massively multithreaded systems

    Get PDF
    Chip Multithreading (CMT) processors promise to deliver higher performance by running more than one stream of instructions in parallel. To exploit CMT's capabilities, programmers have to parallelize their applications, which is not a trivial task. Transactional Memory (TM) is one of parallel programming models that aims at simplifying synchronization by raising the level of abstraction between semantic atomicity and the means by which that atomicity is achieved. TM is a promising programming model but there are still important challenges that must be addressed to make it more practical and efficient in mainstream parallel programming. The first challenge addressed in this dissertation is that of making the evaluation of TM proposals more solid with realistic TM benchmarks and being able to run the same benchmarks on different STM systems. We first introduce a benchmark suite, RMS-TM, a comprehensive benchmark suite to evaluate HTMs and STMs. RMS-TM consists of seven applications from the Recognition, Mining and Synthesis (RMS) domain that are representative of future workloads. RMS-TM features current TM research issues such as nesting and I/O inside transactions, while also providing various TM characteristics. Most STM systems are implemented as user-level libraries: the programmer is expected to manually instrument not only transaction boundaries, but also individual loads and stores within transactions. This library-based approach is increasingly tedious and error prone and also makes it difficult to make reliable performance comparisons. To enable an "apples-to-apples" performance comparison, we then develop a software layer that allows researchers to test the same applications with interchangeable STM back ends. The second challenge addressed is that of enhancing performance and scalability of TM applications running on aggressive multi-core/multi-threaded processors. Performance and scalability of current TM designs, in particular STM desings, do not always meet the programmer's expectation, especially at scale. To overcome this limitation, we propose a new STM design, STM2, based on an assisted execution model in which time-consuming TM operations are offloaded to auxiliary threads while application threads optimistically perform computation. Surprisingly, our results show that STM2 provides, on average, speedups between 1.8x and 5.2x over state-of-the-art STM systems. On the other hand, we notice that assisted-execution systems may show low processor utilization. To alleviate this problem and to increase the efficiency of STM2, we enriched STM2 with a runtime mechanism that automatically and adaptively detects application and auxiliary threads' computing demands and dynamically partition hardware resources between the pair through the hardware thread prioritization mechanism implemented in POWER machines. The third challenge is to define a notion of what it means for a TM program to be correctly synchronized. The current definition of transactional data race requires all transactions to be totally ordered "as if'' serialized by a global lock, which limits the scalability of TM designs. To remove this constraint, we first propose to relax the current definition of transactional data race to allow a higher level of concurrency. Based on this definition we propose the first practical race detection algorithm for C/C++ applications (TRADE) and implement the corresponding race detection tool. Then, we introduce a new definition of transactional data race that is more intuitive, transparent to the underlying TM implementation, can be used for a broad set of C/C++ TM programs. Based on this new definition, we proposed T-Rex, an efficient and scalable race detection tool for C/C++ TM applications. Using TRADE and T-Rex, we have discovered subtle transactional data races in widely-used STAMP applications which have not been reported in the past

    Atomic dataflow model

    Get PDF
    With the recent switch in the design of general purpose processors from frequency scaling of a single processor core towards increasing the number of processor cores, parallel programming became important not only for scientific programming but also for general purpose programming. This also stressed the importance of programmability of existing parallel programming models which were primarily designed for performance. It was soon recognized that new programming models are needed that will make parallel programming possible not only to experts, but to a general programming community. Transactional Memory (TM) is an example which follows this premise. It improves dramatically over any previous synchronization mechanism in terms of programmability and composability, at the price of possibly reduced performance. The main source of performance degradation in Transactional Memory is the overhead of transactional execution. Our work on parallelizing Quake game engine is a clear example of this problem. We show that Software Transactional Memory is superior in terms of programmability compared to lock based programming, but that performance is hindered due to extreme amount of overhead introduced by transactional execution. In the meantime, a significant research effort has been invested in overcoming this problem. Our approach is aimed towards improving the performance of transactional code by reducing transactional data conflicts. The idea is based on the organization of the code in which highly conflicting data is promoted to dataflow tokens that coordinate the execution of transactions. The main contribution of this thesis is Atomic Dataflow model (ADF), a new task-based parallel programming model for C/C++ that integrates dataflow abstractions into the shared memory programming model. The ADF model provides language constructs that allow a programmer to delineate a program into a set of tasks and to explicitly define data dependencies for each task. The task dependency information is conveyed to the ADF runtime system that constructs a dataflow task graph that governs the execution of a program. Additionally, the ADF model allows tasks to share data. The key idea is that computation is triggered by dataflow between tasks but that, within a task, execution occurs by making atomic updates to common mutable state. To that end, the ADF model employs transactional memory, which guarantees atomicity of shared memory updates. The second contribution of this thesis is DaSH - the first comprehensive benchmark suite for hybrid dataflow and shared memory programming models. DaSH features 11 benchmarks, each representing one of the Berkeley dwarfs that capture patterns of communication and computation common to a wide range of emerging applications. DaSH includes sequential and shared-memory implementations based on OpenMP and TBB to facilitate easy comparison between hybrid dataflow implementations and traditional shared memory implementations. We use DaSH not only to evaluate the ADF model, but to also compare it with other two hybrid dataflow models in order to identify the advantages and shortcomings of such models, and motivate further research on their characteristics. Finally, we study applicability of hybrid dataflow models for parallelization of the game engine. We show that hybrid dataflow models decrease the complexity of the parallel game engine implementation by eliminating or restructuring the explicit synchronization that is necessary in shared memory implementations. The corresponding implementations also exhibit good scalability and better speedup than the shared memory parallel implementations, especially in the case of a highly congested game world that contains a large number of game objects. Ultimately, on an eight core machine we were able to obtain 4.72x speedup compared to the sequential baseline, and to improve 49% over the lock-based parallel implementation based on work-sharing.Con el reciente cambio en el diseño de los procesadores de propósito general pasando del aumento de frecuencia al incremento del número de núcleos, la programación paralela se ha convertido en importante no solo para la comunidad científica sino también para la programación en general. Este hecho ha enfatizado la importancia de la programabilidad de los modelos actuales de programación paralela, cuyo objetivo era el rendimiento. Pronto se observó la necesidad de nuevos modelos de programación, para hacer factible la programación paralela a toda la comunidad. Transactional Memory (TM) es un ejemplo de dicho objetivo. Supone una gran mejora sobre cualquier método anterior de sincronización en términos de programabilidad, con una posible reducción del rendimiento como coste. La razón principal de dicha degradación es el sobrecoste de la ejecución transaccional. Nuestro trabajo en la paralelización del motor del juego Quake es un claro ejemplo de este problema. Demostramos que Software Transactional Memory es superior en términos de programabilidad a los modelos de programación basados en locks, pero que el rendimiento es entorpecido por el sobrecoste introducido por TM. Mientras tanto, se ha invertido un importante esfuerzo de investigación para superar dicho problema. Nuestra solución se dirige hacia la mejora del rendimiento del código transaccional reduciendo los conflictos con la información contenida en las transacciones. La idea se basa en la organización del código en el cual la información conflictiva es promocionada a señales del flujo de datos que coordinan la ejecución de las transacciones. La contribución principal de esta tesis es Atomic Dataflow Model (ADF), un nuevo modelo de programación para C/C++ basado en tareas que integra abstracciones de flujo de datos en el modelo de programación de la memoria compartida. El modelo ADF provee construcciones del lenguaje que permiten al programador la definición del programa como un conjunto de tareas, además de la definición explícita de las dependencias de datos para cada tarea. La información de dependencia de la tarea se transmite al runtime de ADF, que construye un grafo de tareas que es el que controla la ejecución de un programa. Adicionalmente, el modelo ADF permite que las tareas compartan información. La idea principal es que la computación es activada por el flujo de datos entre tareas, pero que dentro de una tarea la ejecución ocurre haciendo actualizaciones atómicas a un estado común mutable. Para conseguir este fin, el modelo ADF utiliza TM, que garantiza la atomicidad en las modificaciones de la memoria compartida. La segunda contribución es DaSH, el primer conjunto de benchmarks para los modelos de programación de flujo de datos híbridos y los de memoria compartida. DaSH contiene 11 benchmarks, cada uno representativo de uno de los Berkeley dwarfs que captura patrones de comunicaciones y procesamiento comunes en un amplio rango de aplicaciones emergentes. DaSH incluye implementaciones secuenciales y de memoria compartida basadas en OpenMP y TBB que facilitan la comparación entre los modelos híbridos de flujo de datos e implementaciones de memoria compartida. Nosotros usamos DaSH no solo para evaluar ADF, sino también para compararlo con otros dos modelos híbridos para identificar sus ventajas. Finalmente, estudiamos la aplicabilidad de dichos modelos híbridos para la paralelización del motor del juego. Mostramos que disminuyen la complejidad de la implementación paralela, eliminando o reestructurando la sincronización explícita que es necesaria en las implementaciones de memoria compartida. También se observa una buena escalabilidad y una aceleración mejor, especialmente en el caso de un ambiente de juego muy cargado. En última instancia, sobre una máquina con ocho núcleos se ha obtenido una aceleración del 4.72x comparado con el código secuencial, y una mejora del 49% sobre la implementación paralela basada en locks
    corecore