398 research outputs found
GNA: new framework for statistical data analysis
We report on the status of GNA --- a new framework for fitting large-scale
physical models. GNA utilizes the data flow concept within which a model is
represented by a directed acyclic graph. Each node is an operation on an array
(matrix multiplication, derivative or cross section calculation, etc). The
framework enables the user to create flexible and efficient large-scale lazily
evaluated models, handle large numbers of parameters, propagate parameters'
uncertainties while taking into account possible correlations between them, fit
models, and perform statistical analysis. The main goal of the paper is to give
an overview of the main concepts and methods as well as reasons behind their
design. Detailed technical information is to be published in further works.Comment: 9 pages, 3 figures, CHEP 2018, submitted to EPJ Web of Conference
Efficient Precise Dynamic Data Race Detection For Cpu And Gpu
Data races are notorious bugs. They introduce non-determinism in programs behavior, complicate programs semantics, making it challenging to debug parallel programs. To make parallel programming easier, efficient data race detection has been a research topic in the last decades. However, existing data race detectors either sacrifice precision or incur high overhead, limiting their application to real-world applications and scenarios. This dissertation proposes approaches to improve the performance of dynamic data race detection without undermining precision, by identifying and removing metadata redundancy dynamically. This dissertation also explores ways to make it practical to detect data races dynamically for GPU programs, which has a disparate programming and execution model from CPU workloads. Further, this dissertation shows how the structured synchronization model in GPU programs can simplify the algorithm design of
data race detection for GPU, and how the unique patterns in GPU workloads enable an efficient implementation of the algorithm, yielding a high-performance dynamic data race detector for GPU programs
A Safety-First Approach to Memory Models.
Sequential consistency (SC) is arguably the most intuitive behavior for a shared-memory multithreaded program. It is widely accepted that language-level SC could significantly improve programmability of a multiprocessor system. However, efficiently supporting end-to-end SC remains a challenge as it requires that both compiler and hardware optimizations preserve SC semantics.
Current concurrent languages support a relaxed memory model that requires programmers to explicitly annotate all memory accesses that can participate in a data-race ("unsafe" accesses). This requirement allows compiler and hardware to aggressively optimize unannotated accesses, which are assumed to be data-race-free ("safe" accesses), while still preserving SC semantics. However, unannotated data races are easy for programmers to accidentally introduce and are difficult to detect, and in such cases the safety and correctness of programs are significantly compromised.
This dissertation argues instead for a safety-first approach, whereby every memory operation is treated as potentially unsafe by the compiler and hardware unless it is proven otherwise.
The first solution, DRFx memory model, allows many common compiler and hardware optimizations (potentially SC-violating) on unsafe accesses and uses a runtime support to detect potential SC violations arising from reordering of unsafe accesses. On detecting a potential SC violation, execution is halted before the safety property is compromised.
The second solution takes a different approach and preserves SC in both compiler and hardware. Both SC-preserving compiler and hardware are also built on the safety-first approach. All memory accesses are treated as potentially unsafe by the compiler and hardware. SC-preserving hardware relies on different static and dynamic techniques to identify safe accesses. Our results indicate that supporting SC at the language level is not expensive in terms of performance and hardware complexity.
The dissertation also explores an extension of this safety-first approach for data-parallel accelerators such as Graphics Processing Units (GPUs). Significant microarchitectural differences between CPU and GPU require rethinking of efficient solutions for preserving SC in GPUs. The proposed solution based on our SC-preserving approach performs nearly on par with the baseline GPU that implements a data-race-free-0 memory model.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120794/1/ansingh_1.pd
Transactional memory on heterogeneous architectures
Tesis Leida el 9 de Marzo de 2018.Si observamos las necesidades computacionales de hoy, y tratamos de predecir
las necesidades del mañana, podemos concluir que el procesamiento heterogéneo
estará presente en muchos dispositivos y aplicaciones.
El motivo es lógico: algoritmos diferentes y datos de naturaleza diferente encajan mejor
en unos dispositivos de cómputo que en otros. Pongamos como ejemplo una
tecnología de vanguardia como son los vehículos inteligentes. En este tipo de
aplicaciones la computación heterogénea no es una opción, sino un requisito.
En este tipo de vehículos se recolectan y analizan imágenes, tarea para la cual
los procesadores gráficos (GPUs) son muy eficientes.
Muchos de estos vehículos utilizan algoritmos sencillos,
pero con grandes requerimientos de tiempo real, que deben
implementarse directamente en hardware utilizando FPGAs.
Y, por supuesto, los procesadores multinúcleo tienen un
papel fundamental en estos sistemas, tanto organizando el trabajo de otros coprocesadores
como ejecutando tareas en las que ningún otro procesador
es más eficiente. No obstante, los procesadores tampoco siguen siendo dispositivos
homogéneos. Los diferentes núcleos de un procesador pueden
ofrecer diferentes características en términos de potencia y consumo
energético que se adapten a las necesidades de cómputo de la aplicación.
Programar este conjunto de dispositivos es una tarea compleja, especialmente
en su sincronización.
Habitualmente, esta sincronización se basa en operaciones atómicas, ejecución y
terminación de kernels, barreras y señales. Con estas primitivas de sincronización
básicas se pueden construir otras estructuras más complejas.
Sin embargo, la programación de estos
mecanismos es tediosa y propensa a fallos. La memoria transaccional
(TM por sus siglas en inglés) se ha propuesto como un mecanismo
avanzado a la vez que simple para garantizar la exclusión mutua
Specialization without complexity in heterogeneous memory systems
The end of Dennard scaling and Moore's law has motivated a rise in the use of parallelism and hardware specialization in computer system design. Across all compute domains, applications have increasingly relied on specialized devices such as GPUs, DSPs, FPGAs, etc., to execute tasks faster and more efficiently, but interfacing these diverse devices within a heterogeneous system remains an important challenge. Early heterogeneous systems were loosely coupled and lacked a shared coherent memory interface, so specialization was reserved for highly regular code patterns with coarse-grained synchronization requirements. More recently, the need to accelerate applications with more irregular and fine-grained sharing patterns has led to significant research into closer integration of specialized devices.
A single global address space enables improved programmability, communication efficiency, data reuse, and load balancing for emerging heterogeneous applications. Consequently, there have been many attempts to integrate specialized devices and their caches into a single coherent memory hierarchy to improve performance in future systems-on-chip (SoCs). However, coherence is particularly difficult to implement in heterogeneous systems. Differences in parallelism, locality, and synchronization in high-throughput accelerators such as GPUs means that coherence and consistency strategies designed for CPUs are ineffective, and evaluating the performance of alternative strategies is difficult. Recent efforts to implement coherence for such devices involve a simple software-driven coherence strategy combined with complex extensions to a conventional memory consistency model, which guarantees sequential consistency (SC) for programs that are data race-free (DRF). The first extension, scoped synchronization, avoids coherence costs when synchronization is guaranteed to be local, but it requires the use of the heterogeneous race-free (HRF) consistency model, which limits sharing patterns and increases the burden on the programmer. The second extension, relaxed atomics, allows the programmer to avoid costly ordering constraints when they are unnecessary for functionality, but existing consistency models offer complex and often poorly specified semantics when relaxed atomics are used. Once an appropriate coherence and consistency strategy is determined for a device, interfacing it with devices using different strategies poses another critical challenge. Existing integration strategies are incremental, either sacrificing system flexibility or incurring significant added complexity to achieve this goal. A rethinking of heterogeneous coherence and protocol integration from the ground up is needed.
This work lays out a path to implementing flexible and efficient heterogeneous coherence without adding complexity to the consistency model or the system design. To help understand the memory demands of emerging specialized hardware, we first describe a performance analysis tool we developed for highly parallel workloads. Insights from this tool helped guide the development of a collection of coherence and consistency innovations for high-throughput accelerators. On the coherence side, we describe two innovations, DeNovo for GPUs and heterogeneous lazy release consistency (hLRC), which demonstrate that scoped synchronization is not necessary for cache efficiency in high-throughput devices. On the consistency side, this work describes the DRFrlx consistency model, which formalizes safe use cases of atomic relaxation. Again, we offer these benefits while retaining a simple SC-centric DRF consistency model. Finally, to address the challenge of integrating diverse coherence strategies, we present the Spandex coherence interface. Spandex can flexibly and simply integrate devices with a broad range of memory demands in an SoC, and we show how this flexibility enables new performance optimizations that can take advantage of hints about the expected memory demands of an application. Together, these innovations establish a framework for integrating future SoCs that can dynamically adapt to serve the diverse memory demands of future accelerators without incurring complexity for hardware or software designers
- …