398 research outputs found

    GNA: new framework for statistical data analysis

    Full text link
    We report on the status of GNA --- a new framework for fitting large-scale physical models. GNA utilizes the data flow concept within which a model is represented by a directed acyclic graph. Each node is an operation on an array (matrix multiplication, derivative or cross section calculation, etc). The framework enables the user to create flexible and efficient large-scale lazily evaluated models, handle large numbers of parameters, propagate parameters' uncertainties while taking into account possible correlations between them, fit models, and perform statistical analysis. The main goal of the paper is to give an overview of the main concepts and methods as well as reasons behind their design. Detailed technical information is to be published in further works.Comment: 9 pages, 3 figures, CHEP 2018, submitted to EPJ Web of Conference

    Efficient Precise Dynamic Data Race Detection For Cpu And Gpu

    Get PDF
    Data races are notorious bugs. They introduce non-determinism in programs behavior, complicate programs semantics, making it challenging to debug parallel programs. To make parallel programming easier, efficient data race detection has been a research topic in the last decades. However, existing data race detectors either sacrifice precision or incur high overhead, limiting their application to real-world applications and scenarios. This dissertation proposes approaches to improve the performance of dynamic data race detection without undermining precision, by identifying and removing metadata redundancy dynamically. This dissertation also explores ways to make it practical to detect data races dynamically for GPU programs, which has a disparate programming and execution model from CPU workloads. Further, this dissertation shows how the structured synchronization model in GPU programs can simplify the algorithm design of data race detection for GPU, and how the unique patterns in GPU workloads enable an efficient implementation of the algorithm, yielding a high-performance dynamic data race detector for GPU programs

    A Safety-First Approach to Memory Models.

    Full text link
    Sequential consistency (SC) is arguably the most intuitive behavior for a shared-memory multithreaded program. It is widely accepted that language-level SC could significantly improve programmability of a multiprocessor system. However, efficiently supporting end-to-end SC remains a challenge as it requires that both compiler and hardware optimizations preserve SC semantics. Current concurrent languages support a relaxed memory model that requires programmers to explicitly annotate all memory accesses that can participate in a data-race ("unsafe" accesses). This requirement allows compiler and hardware to aggressively optimize unannotated accesses, which are assumed to be data-race-free ("safe" accesses), while still preserving SC semantics. However, unannotated data races are easy for programmers to accidentally introduce and are difficult to detect, and in such cases the safety and correctness of programs are significantly compromised. This dissertation argues instead for a safety-first approach, whereby every memory operation is treated as potentially unsafe by the compiler and hardware unless it is proven otherwise. The first solution, DRFx memory model, allows many common compiler and hardware optimizations (potentially SC-violating) on unsafe accesses and uses a runtime support to detect potential SC violations arising from reordering of unsafe accesses. On detecting a potential SC violation, execution is halted before the safety property is compromised. The second solution takes a different approach and preserves SC in both compiler and hardware. Both SC-preserving compiler and hardware are also built on the safety-first approach. All memory accesses are treated as potentially unsafe by the compiler and hardware. SC-preserving hardware relies on different static and dynamic techniques to identify safe accesses. Our results indicate that supporting SC at the language level is not expensive in terms of performance and hardware complexity. The dissertation also explores an extension of this safety-first approach for data-parallel accelerators such as Graphics Processing Units (GPUs). Significant microarchitectural differences between CPU and GPU require rethinking of efficient solutions for preserving SC in GPUs. The proposed solution based on our SC-preserving approach performs nearly on par with the baseline GPU that implements a data-race-free-0 memory model.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120794/1/ansingh_1.pd

    Transactional memory on heterogeneous architectures

    Get PDF
    Tesis Leida el 9 de Marzo de 2018.Si observamos las necesidades computacionales de hoy, y tratamos de predecir las necesidades del mañana, podemos concluir que el procesamiento heterogéneo estará presente en muchos dispositivos y aplicaciones. El motivo es lógico: algoritmos diferentes y datos de naturaleza diferente encajan mejor en unos dispositivos de cómputo que en otros. Pongamos como ejemplo una tecnología de vanguardia como son los vehículos inteligentes. En este tipo de aplicaciones la computación heterogénea no es una opción, sino un requisito. En este tipo de vehículos se recolectan y analizan imágenes, tarea para la cual los procesadores gráficos (GPUs) son muy eficientes. Muchos de estos vehículos utilizan algoritmos sencillos, pero con grandes requerimientos de tiempo real, que deben implementarse directamente en hardware utilizando FPGAs. Y, por supuesto, los procesadores multinúcleo tienen un papel fundamental en estos sistemas, tanto organizando el trabajo de otros coprocesadores como ejecutando tareas en las que ningún otro procesador es más eficiente. No obstante, los procesadores tampoco siguen siendo dispositivos homogéneos. Los diferentes núcleos de un procesador pueden ofrecer diferentes características en términos de potencia y consumo energético que se adapten a las necesidades de cómputo de la aplicación. Programar este conjunto de dispositivos es una tarea compleja, especialmente en su sincronización. Habitualmente, esta sincronización se basa en operaciones atómicas, ejecución y terminación de kernels, barreras y señales. Con estas primitivas de sincronización básicas se pueden construir otras estructuras más complejas. Sin embargo, la programación de estos mecanismos es tediosa y propensa a fallos. La memoria transaccional (TM por sus siglas en inglés) se ha propuesto como un mecanismo avanzado a la vez que simple para garantizar la exclusión mutua

    Specialization without complexity in heterogeneous memory systems

    Get PDF
    The end of Dennard scaling and Moore's law has motivated a rise in the use of parallelism and hardware specialization in computer system design. Across all compute domains, applications have increasingly relied on specialized devices such as GPUs, DSPs, FPGAs, etc., to execute tasks faster and more efficiently, but interfacing these diverse devices within a heterogeneous system remains an important challenge. Early heterogeneous systems were loosely coupled and lacked a shared coherent memory interface, so specialization was reserved for highly regular code patterns with coarse-grained synchronization requirements. More recently, the need to accelerate applications with more irregular and fine-grained sharing patterns has led to significant research into closer integration of specialized devices. A single global address space enables improved programmability, communication efficiency, data reuse, and load balancing for emerging heterogeneous applications. Consequently, there have been many attempts to integrate specialized devices and their caches into a single coherent memory hierarchy to improve performance in future systems-on-chip (SoCs). However, coherence is particularly difficult to implement in heterogeneous systems. Differences in parallelism, locality, and synchronization in high-throughput accelerators such as GPUs means that coherence and consistency strategies designed for CPUs are ineffective, and evaluating the performance of alternative strategies is difficult. Recent efforts to implement coherence for such devices involve a simple software-driven coherence strategy combined with complex extensions to a conventional memory consistency model, which guarantees sequential consistency (SC) for programs that are data race-free (DRF). The first extension, scoped synchronization, avoids coherence costs when synchronization is guaranteed to be local, but it requires the use of the heterogeneous race-free (HRF) consistency model, which limits sharing patterns and increases the burden on the programmer. The second extension, relaxed atomics, allows the programmer to avoid costly ordering constraints when they are unnecessary for functionality, but existing consistency models offer complex and often poorly specified semantics when relaxed atomics are used. Once an appropriate coherence and consistency strategy is determined for a device, interfacing it with devices using different strategies poses another critical challenge. Existing integration strategies are incremental, either sacrificing system flexibility or incurring significant added complexity to achieve this goal. A rethinking of heterogeneous coherence and protocol integration from the ground up is needed. This work lays out a path to implementing flexible and efficient heterogeneous coherence without adding complexity to the consistency model or the system design. To help understand the memory demands of emerging specialized hardware, we first describe a performance analysis tool we developed for highly parallel workloads. Insights from this tool helped guide the development of a collection of coherence and consistency innovations for high-throughput accelerators. On the coherence side, we describe two innovations, DeNovo for GPUs and heterogeneous lazy release consistency (hLRC), which demonstrate that scoped synchronization is not necessary for cache efficiency in high-throughput devices. On the consistency side, this work describes the DRFrlx consistency model, which formalizes safe use cases of atomic relaxation. Again, we offer these benefits while retaining a simple SC-centric DRF consistency model. Finally, to address the challenge of integrating diverse coherence strategies, we present the Spandex coherence interface. Spandex can flexibly and simply integrate devices with a broad range of memory demands in an SoC, and we show how this flexibility enables new performance optimizations that can take advantage of hints about the expected memory demands of an application. Together, these innovations establish a framework for integrating future SoCs that can dynamically adapt to serve the diverse memory demands of future accelerators without incurring complexity for hardware or software designers

    RC3: Consistency directed cache coherence for x86-64 with RC extensions

    Get PDF
    corecore