Search CORE

2,070 research outputs found

A Safety-First Approach to Memory Models.

Author: Singh Abhayendra Narayan
Publication venue
Publication date: 01/01/2016
Field of study

Sequential consistency (SC) is arguably the most intuitive behavior for a shared-memory multithreaded program. It is widely accepted that language-level SC could significantly improve programmability of a multiprocessor system. However, efficiently supporting end-to-end SC remains a challenge as it requires that both compiler and hardware optimizations preserve SC semantics. Current concurrent languages support a relaxed memory model that requires programmers to explicitly annotate all memory accesses that can participate in a data-race ("unsafe" accesses). This requirement allows compiler and hardware to aggressively optimize unannotated accesses, which are assumed to be data-race-free ("safe" accesses), while still preserving SC semantics. However, unannotated data races are easy for programmers to accidentally introduce and are difficult to detect, and in such cases the safety and correctness of programs are significantly compromised. This dissertation argues instead for a safety-first approach, whereby every memory operation is treated as potentially unsafe by the compiler and hardware unless it is proven otherwise. The first solution, DRFx memory model, allows many common compiler and hardware optimizations (potentially SC-violating) on unsafe accesses and uses a runtime support to detect potential SC violations arising from reordering of unsafe accesses. On detecting a potential SC violation, execution is halted before the safety property is compromised. The second solution takes a different approach and preserves SC in both compiler and hardware. Both SC-preserving compiler and hardware are also built on the safety-first approach. All memory accesses are treated as potentially unsafe by the compiler and hardware. SC-preserving hardware relies on different static and dynamic techniques to identify safe accesses. Our results indicate that supporting SC at the language level is not expensive in terms of performance and hardware complexity. The dissertation also explores an extension of this safety-first approach for data-parallel accelerators such as Graphics Processing Units (GPUs). Significant microarchitectural differences between CPU and GPU require rethinking of efficient solutions for preserving SC in GPUs. The proposed solution based on our SC-preserving approach performs nearly on par with the baseline GPU that implements a data-race-free-0 memory model.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120794/1/ansingh_1.pd

Deep Blue Documents at the University of Michigan

Coupled Kinetic-Fluid Simulations of Ganymede's Magnetosphere and Hybrid Parallelization of the Magnetohydrodynamics Model

Author: Zhou Hongyang
Publication venue
Publication date: 01/01/2020
Field of study

The largest moon in the solar system, Ganymede, is the only moon known to possess a strong intrinsic magnetic field. The interaction between the Jovian plasma and Ganymede's magnetic field creates a mini-magnetosphere with periodically varying upstream conditions, which creates a perfect laboratory in nature for studying magnetic reconnection and magnetospheric physics. Using the latest version of Space Weather Modeling Framework (SWMF), we study the upstream plasma interactions and dynamics in this subsonic, sub-Alfvénic system. We have developed a coupled fluid-kinetic Hall Magnetohydrodynamics with embedded Particle-in-Cell (MHD-EPIC) model for Ganymede's magnetosphere, with a self-consistently coupled resistive body representing the electrical properties of the moon's interior, improved inner boundary conditions, and high resolution charge and energy conserved PIC scheme. I reimplemented the boundary condition setup in SWMF for more versatile control and functionalities, and developed a new user module for Ganymede's simulation. Results from the models are validated with Galileo magnetometer data of all close encounters and compared with Plasma Subsystem (PLS) data. The energy fluxes associated with the upstream reconnection in the model is estimated to be about 10^-7 W/cm^2, which accounts for about 40% to the total peak auroral emissions observed by the Hubble Space Telescope. We find that under steady upstream conditions, magnetopause reconnection in our fluid-kinetic simulations occurs in a non-steady manner. Flux ropes with length of Ganymede's radius form on the magnetopause at a rate about 3/minute and create spatiotemporal variations in plasma and field properties. Upon reaching proper grid resolutions, the MHD-EPIC model can resolve both electron and ion kinetics at the magnetopause and show localized crescent shape distribution in both ion and electron phase space, non-gyrotropic and non-isotropic behavior inside the diffusion regions. The estimated global reconnection rate from the models is about 80 kV with 60% efficiency. There is weak evidence of

sim 1

minute periodicity in the temporal variations of the reconnection rate due to the dynamic reconnection process. The requirement of high fidelity results promotes the development of hybrid parallelized numerical model strategy and faster data processing techniques. The state-of-the-art finite volume/difference MHD code Block Adaptive Tree Solarwind Roe Upwind Scheme (BATS-R-US) was originally designed with pure MPI parallelization. The maximum problem size achievable was limited by the storage requirements of the block tree structure. To mitigate this limitation, we have added multithreaded OpenMP parallelization to the previous pure MPI implementation. We opt to use a coarse-grained approach by making the loops over grid blocks multithreaded and have succeeded in making BATS-R-US an efficient hybrid parallel code with modest changes in the source code while preserving the performance. Good weak scalings up to 50,0000 and 25,0000 of cores are achieved for the explicit and implicit time stepping schemes, respectively. This parallelization strategy greatly extends the possible simulation scale by an order of magnitude, and paves the way for future GPU-portable code development. To improve visualization and data processing, I have developed a whole new data processing workflow with the Julia programming language for efficient data analysis and visualization. As a summary, 1. I build a single fluid Hall MHD-EPIC model of Ganymede's magnetosphere; 2. I did detailed analysis of the upstream reconnection; 3. I developed a MPI+OpenMP parallel MHD model with BATSRUS; 4. I wrote a package for data analysis and visualization.PHDClimate and Space Sciences and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163032/1/hyzhou_1.pd

Deep Blue Documents at the University of Michigan

New Techniques for On-line Testing and Fault Mitigation in GPUs

Author: RODRIGUEZ CONDIA JOSIE ESTEBAN
Publication venue: country:Italy
Publication date: 24/09/2021
Field of study

L'abstract è presente nell'allegato / the abstract is in the attachmen

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Parallelizing a network intrusion detection system using a GPU.

Author: Madhusoodhanan Sathik Anju Panicker, 1984-
Publication venue: ThinkIR: The University of Louisville\u27s Institutional Repository
Publication date: 01/05/2012
Field of study

As network speeds continue to increase and attacks get increasingly more complicated, there is need to improved detection algorithms and improved performance of Network Intrusion Detection Systems (NIDS). Recently, several attempts have been made to use the underutilized parallel processing capabilities of GPUs, to offload the costly NIDS pattern matching algorithms. This thesis presents an interface for NIDS Snort that allows porting of the pattern-matching algorithm to run on a GPU. The analysis show that this system can achieve up to four times speedup over the existing Snort implementation and that GPUs can be effectively utilized to perform intensive computational processes like pattern matching

University of Louisville

Towards lightweight and high-performance hardware transactional memory

Author: Tomić Sasa
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2012
Field of study

Conventional lock-based synchronization serializes accesses to critical sections guarded by the same lock. Using multiple locks brings the possibility of a deadlock or a livelock in the program, making parallel programming a difficult task. Transactional Memory (TM) is a promising paradigm for parallel programming, offering an alternative to lock-based synchronization. TM eliminates the risk of deadlocks and livelocks, while it provides the desirable semantics of Atomicity, Consistency, and Isolation of critical sections. TM speculatively executes a series of memory accesses as a single, atomic, transaction. The speculative changes of a transaction are kept private until the transaction commits. If a transaction can break the atomicity or cause a deadlock or livelock, the TM system aborts the transaction and rolls back the speculative changes. To be effective, a TM implementation should provide high performance and scalability. While implementations of TM in pure software (STM) do not provide desirable performance, Hardware TM (HTM) implementations introduce much smaller overhead and have relatively good scalability, due to their better control of hardware resources. However, many HTM systems support only the transactions that fit limited hardware resources (for example, private caches), and fall back to software mechanisms if hardware limits are reached. These HTM systems, called best-effort HTMs, are not desirable since they force a programmer to think in terms of hardware limits, to use both HTM and STM, and to manage concurrent transactions in HTM and STM. In contrast with best-effort HTMs, unbounded HTM systems support overflowed transactions, that do not fit into private caches. Unbounded HTM systems often require complex protocols or expensive hardware mechanisms for conflict detection between overflowed transactions. In addition, an execution with overflowed transactions is often much slower than an execution that has only regular transactions. This is typically due to restrictive or approximative conflict management mechanism used for overflowed transactions. In this thesis, we study hardware implementations of transactional memory, and make three main contributions. First, we improve the general performance of HTM systems by proposing a scalable protocol for conflict management. The protocol has precise conflict detection, in contrast with often-employed inexact Bloom-filter-based conflict detection, which often falsely report conflicts between transactions. Second, we propose a best-effort HTM that utilizes the new scalable conflict detection protocol, termed EazyHTM. EazyHTM allows parallel commits for all non-conflicting transactions, and generally simplifies transaction commits. Finally, we propose an unbounded HTM that extends and improves the initial protocol for conflict management, and we name it EcoTM. EcoTM features precise conflict detection, and it efficiently supports large as well as small and short transactions. The key idea of EcoTM is to leverage an observation that very few locations are actually conflicting, even if applications have high contention. In EcoTM, each core locally detects if a cache line is non-conflicting, and conflict detection mechanism is invoked only for the few potentially conflicting cache lines.La Sincronización tradicional basada en los cerrojos de exclusión mutua (locks) serializa los accesos a las secciones críticas protegidas este cerrojo. La utilización de varios cerrojos en forma concurrente y/o paralela aumenta la posibilidad de entrar en abrazo mortal (deadlock) o en un bloqueo activo (livelock) en el programa, está es una de las razones por lo cual programar en forma paralela resulta ser mucho mas dificultoso que programar en forma secuencial. La memoria transaccional (TM) es un paradigma prometedor para la programación paralela, que ofrece una alternativa a los cerrojos. La memoria transaccional tiene muchas ventajas desde el punto de vista tanto práctico como teórico. TM elimina el riesgo de bloqueo mutuo y de bloqueo activo, mientras que proporciona una semántica de atomicidad, coherencia, aislamiento con características similares a las secciones críticas. TM ejecuta especulativamente una serie de accesos a la memoria como una transacción atómica. Los cambios especulativos de la transacción se mantienen privados hasta que se confirma la transacción. Si una transacción entra en conflicto con otra transacción o sea que alguna de ellas escribe en una dirección que la otra leyó o escribió, o se entra en un abrazo mortal o en un bloqueo activo, el sistema de TM aborta la transacción y revierte los cambios especulativos. Para ser eficaz, una implementación de TM debe proporcionar un alto rendimiento y escalabilidad. Las implementaciones de TM en el software (STM) no proporcionan este desempeño deseable, en cambio, las mplementaciones de TM en hardware (HTM) tienen mejor desempeño y una escalabilidad relativamente buena, debido a su mejor control de los recursos de hardware y que la resolución de los conflictos así el mantenimiento y gestión de los datos se hace en hardware. Sin embargo, muchos de los sistemas de HTM están limitados a los recursos de hardware disponibles, por ejemplo el tamaño de las caches privadas, y dependen de mecanismos de software para cuando esos límites son sobrepasados. Estos sistemas HTM, llamados best-effort HTM no son deseables, ya que obligan al programador a pensar en términos de los límites existentes en el hardware que se esta utilizando, así como en el sistema de STM que se llama cuando los recursos son sobrepasados. Además, tiene que resolver que transacciones hardware y software se ejecuten concurrentemente. En cambio, los sistemas de HTM ilimitados soportan un numero de operaciones ilimitadas o sea no están restringidos a límites impuestos artificialmente por el hardware, como ser el tamaño de las caches o buffers internos. Los sistemas HTM ilimitados por lo general requieren protocolos complejos o mecanismos muy costosos para la detección de conflictos y el mantenimiento de versiones de los datos entre las transacciones. Por otra parte, la ejecución de transacciones es a menudo mucho más lenta que en una ejecución sobre un sistema de HTM que este limitado. Esto es debido al que los mecanismos utilizados en el HTM limitado trabaja con conjuntos de datos relativamente pequeños que caben o están muy cerca del núcleo del procesador. En esta tesis estudiamos implementaciones de TM en hardware. Presentaremos tres contribuciones principales: Primero, mejoramos el rendimiento general de los sistemas, al proponer un protocolo escalable para la gestión de conflictos. El protocolo detecta los conflictos de forma precisa, en contraste con otras técnicas basadas en filtros Bloom, que pueden reportar conflictos falsos entre las transacciones. Segundo, proponemos un best-effort HTM que utiliza el nuevo protocolo escalable detección de conflictos, denominado EazyHTM. EazyHTM permite la ejecución completamente paralela de todas las transacciones sin conflictos, y por lo general simplifica la ejecución. Por último, proponemos una extensión y mejora del protocolo inicial para la gestión de conflictos, que llamaremos EcoTM. EcoTM cuenta con detección de conflictos precisa, eficiente y es compatible tanto con transacciones grandes como con pequeñas. La idea clave de EcoTM es aprovechar la observación que en muy pocas ubicaciones de memoria aparecen los conflictos entre las transacciones, incluso en aplicaciones tienen muchos conflictos. En EcoTM, cada núcleo detecta localmente si la línea es conflictiva, además existe un mecanismo de detección de conflictos detallado que solo se activa para las pocas líneas de memoria que son potencialmente conflictivas

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Secretaría de Estado de Cultura

New hardware support transactional memory and parallel debugging in multicore processors

Author: Orosa Nogueira Lois
Publication venue
Publication date: 01/01/2013
Field of study

This thesis contributes to the area of hardware support for parallel programming by introducing new hardware elements in multicore processors, with the aim of improving the performance and optimize new tools, abstractions and applications related with parallel programming, such as transactional memory and data race detectors. Specifically, we configure a hardware transactional memory system with signatures as part of the hardware support, and we develop a new hardware filter for reducing the signature size. We also develop the first hardware asymmetric data race detector (which is also able to tolerate them), based also in hardware signatures. Finally, we propose a new module of hardware signatures that solves some of the problems that we found in the previous tools related with the lack of flexibility in hardware signatures

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Institucional da Universidade de Santiago de Compostela

Transactional memory on heterogeneous architectures

Author: Villegas Fernandez Alejandro
Publication venue: UMA Editorial
Publication date: 01/01/2018
Field of study

Tesis Leida el 9 de Marzo de 2018.Si observamos las necesidades computacionales de hoy, y tratamos de predecir las necesidades del mañana, podemos concluir que el procesamiento heterogéneo estará presente en muchos dispositivos y aplicaciones. El motivo es lógico: algoritmos diferentes y datos de naturaleza diferente encajan mejor en unos dispositivos de cómputo que en otros. Pongamos como ejemplo una tecnología de vanguardia como son los vehículos inteligentes. En este tipo de aplicaciones la computación heterogénea no es una opción, sino un requisito. En este tipo de vehículos se recolectan y analizan imágenes, tarea para la cual los procesadores gráficos (GPUs) son muy eficientes. Muchos de estos vehículos utilizan algoritmos sencillos, pero con grandes requerimientos de tiempo real, que deben implementarse directamente en hardware utilizando FPGAs. Y, por supuesto, los procesadores multinúcleo tienen un papel fundamental en estos sistemas, tanto organizando el trabajo de otros coprocesadores como ejecutando tareas en las que ningún otro procesador es más eficiente. No obstante, los procesadores tampoco siguen siendo dispositivos homogéneos. Los diferentes núcleos de un procesador pueden ofrecer diferentes características en términos de potencia y consumo energético que se adapten a las necesidades de cómputo de la aplicación. Programar este conjunto de dispositivos es una tarea compleja, especialmente en su sincronización. Habitualmente, esta sincronización se basa en operaciones atómicas, ejecución y terminación de kernels, barreras y señales. Con estas primitivas de sincronización básicas se pueden construir otras estructuras más complejas. Sin embargo, la programación de estos mecanismos es tediosa y propensa a fallos. La memoria transaccional (TM por sus siglas en inglés) se ha propuesto como un mecanismo avanzado a la vez que simple para garantizar la exclusión mutua

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Institucional Universidad de Málaga

Simulating the nonlinear QED vacuum

Author: Lindner Andreas Maximilian
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 07/07/2023
Field of study

Digitale Hochschulschriften der LMU

GPU Parallel Implementation of Dual-Depth Sparse Probabilistic Latent Semantic Analysis for Hyperspectral Unmixing

Author: Fernandez-Beltran Ruben
Gallardo Jaramago Jose Antonio
Haut Juan M.
Paoletti Mercedes Eugenia
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Hyperspectral unmixing (HU) is an important task for remotely sensed hyperspectral (HS) data exploitation. It comprises the identification of pure spectral signatures (endmembers) and their corresponding fractional abundances in each pixel of the HS data cube. Several methods have been developed for (semi-) supervised and automatic identification of endmembers and abundances. Recently, the statistical dual-depth sparse probabilistic latent semantic analysis (DEpLSA) method has been developed to tackle the HU problem as a latent topic-based approach in which both endmembers and abundances can be simultaneously estimated according to the semantics encapsulated by the latent topic space. However, statistical models usually lead to computationally demanding algorithms and the computational time of the DEpLSA is often too high for practical use, in particular, when the dimensionality of the HS data cube is large. In order to mitigate this limitation, this article resorts to graphical processing units (GPUs) to provide a new parallel version of the DEpLSA, developed using the NVidia compute device unified architecture. Our experimental results, conducted using four well-known HS datasets and two different GPU architectures (GTX 1080 and Tesla P100), show that our parallel versions of the DEpLSA and the traditional pLSA approach can provide accurate HU results fast enough for practical use, accelerating the corresponding serial versions in at least 30x in the GTX 1080 and up to 147x in the Tesla P100 GPU, which are quite significant acceleration factors that increase with the image size, thus allowing for the possibility of the fast processing of massive HS data repositories

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositori Institucional de la Universitat Jaume I