1,921 research outputs found
Hardware support for Local Memory Transactions on GPU Architectures
Graphics Processing Units (GPUs) are popular hardware accelerators for data-parallel applications, enabling the execution of thousands of threads in a Single Instruction - Multiple Thread (SIMT) fashion. However, the SIMT execution model is not efficient when code includes critical sections to protect the access to data shared by the running threads. In addition, GPUs offer two shared spaces to the threads, local memory and global memory. Typical solutions to thread synchronization include the use of atomics to implement locks, the serialization of the execution of the critical section, or delegating the execution of the critical section to the host CPU, leading to suboptimal performance.
In the multi-core CPU world, transactional memory (TM) was proposed as an alternative to locks to coordinate concurrent threads. Some solutions for GPUs started to appear in the literature. In contrast to these earlier proposals, our approach is to design hardware support for TM in two levels. The first level is a fast and lightweight solution for coordinating threads that share the local memory, while the second level coordinates threads through the global memory. In this paper we present GPU-LocalTM as a hardware TM (HTM) support for the first level. GPU-LocalTM offers simple conflict detection and version management mechanisms that minimize the hardware resources required for its implementation. For the workloads studied, GPU-LocalTM provides between 1.25-80X speedup over serialized critical sections, while the overhead introduced by transaction management is lower than 20%.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech
Recommended from our members
Concentration Inequalities in the Wild: Case Studies in Blockchain & Reinforcement Learning
Concentration inequalities (CIs) are a powerful tool that provide probability bounds on how a random variable deviates from its expectation. In this dissertation, first I describe a blockchain protocol that I have developed, called Graphene, which uses CIs to provide probabilistic guarantees on performance. Second, I analyze the extent to which CIs are robust when the assumptions they require are violated, using Reinforcement Learning (RL) as the domain.
Graphene is a method for interactive set reconciliation among peers in blockchains and related distributed systems. Through the novel combination of a Bloom filter and an Invertible Bloom Lookup Table, Graphene uses a fraction of the network bandwidth used by deployed work for one- and two-way synchronization. It is a fast and implementation-independent algorithm that uses CIs for parameterizing an IBLT so that it is optimal in size for a given desired decode rate. I characterize performance improvements through analysis, detailed simulation, and deployment results for Bitcoin Cash, a prominent cryptocurrency. Implementations of Graphene, IBLTs, and the IBLT optimization algorithm are all open-source code.
Second, I analyze the extent to which existing methods rely on accurate training data for a specific class of RL algorithms, known as Safe and Seldonian RL. Several Seldonian RL algorithms have a component called the safety test, which uses CIs to lower bound the performance of a new policy with training data collected from another policy. I introduce a new measure of security to quantify the susceptibility to corruptions in training data, and show that a couple of Seldonian RL methods are extremely sensitive to even a few data corruptions, completely breaking the probability bounds guaranteed by CIs. I then introduce a new algorithm, called Panacea, that is more robust against data corruptions, and demonstrate its usage in practice on some RL problems, including a grid-world and diabetes treatment simulation
Multiversioning hardware transactional memory for fail-operational multithreaded applications
Modern safety-critical embedded applications like autonomous driving need to be fail-operational, while high performance and low power consumption are demanded simultaneously. The prevalent fault tolerance mechanisms suffer from disadvantages: Some (e.g. triple modular redundancy) require a substantial amount of duplication, resulting in high hardware costs and power consumption. Others, like lockstep, require supplementary checkpointing mechanisms to recover from errors. Further approaches (e.g. software-based process-level redundancy) cannot handle the indeterminism caused by multithreaded execution. This paper presents a novel approach for fail-operational systems using hardware transactional memory for embedded systems. The hardware transactional memory is extended to support multiple versions, enabling redundant atomic operations and recovery in case of an error. In our FPGA-based evaluation, we executed the PARSEC benchmark suite with fault tolerance on 12 cores. The evaluation shows that multiversioning can successfully recover from all transient errors with an overhead comparable to fault tolerance mechanisms without recovery
New hardware support transactional memory and parallel debugging in multicore processors
This thesis contributes to the area of hardware support for parallel programming by introducing new hardware elements in multicore processors, with the aim of improving the performance and optimize new tools, abstractions and applications related with parallel programming, such as transactional memory and data race detectors. Specifically, we configure a hardware transactional memory system with signatures as part of the hardware support, and we develop a new hardware filter for reducing the signature size. We also develop the first hardware asymmetric data race detector (which is also able to tolerate them), based also in hardware signatures. Finally, we propose a new module of hardware signatures that solves some of the problems that we found in the previous tools related with the lack of flexibility in hardware signatures
Low overhead scheduling of LoRa transmissions for improved scalability
Recently, LoRaWAN has attracted much attention for the realization of many Internet of Things applications because it offers low-power, long-distance, and low-cost wireless communication. Recent works have shown that the LoRaWAN specification for class A devices comes with scalability limitations due to the ALOHA-like nature of the MAC layer. In this paper, we propose a synchronization and scheduling mechanism for LoRaWAN networks consisting of class A devices. The mechanism runs on top of the LoRaWAN MAC layer. A central network synchronization and scheduling entity will schedule uplink and downlink transmissions. In order to reduce the synchronization packet length, all time slots that are being assigned to an end node are encoded in a probabilistic space-efficient data structure. An end node will check if a time slot is part of the received data structure in order to determine when to transmit. Time slots are assigned based on the traffic needs of the end nodes. We show that in case of a nonsaturated multichannel LoRaWAN network with synchronization being done in a separate channel, the packet delivery ratio (PDR) is easily 7% (for SF7) to 30% (for SF12) higher than in an unsynchronized LoRaWAN network. For saturated networks, the differences in PDR become more profound as nodes are only scheduled as long as they can be accommodated given the remaining capacity of the network. The synchronization process will use less than 3-mAh extra battery capacity per end node during a one year period, for synchronization periods longer than three days. This is less than the battery capacity used to transmit packets that are going to be lost in an unsynchronized network due to collisions
- …