1,921 research outputs found

    Hardware support for Local Memory Transactions on GPU Architectures

    Get PDF
    Graphics Processing Units (GPUs) are popular hardware accelerators for data-parallel applications, enabling the execution of thousands of threads in a Single Instruction - Multiple Thread (SIMT) fashion. However, the SIMT execution model is not efficient when code includes critical sections to protect the access to data shared by the running threads. In addition, GPUs offer two shared spaces to the threads, local memory and global memory. Typical solutions to thread synchronization include the use of atomics to implement locks, the serialization of the execution of the critical section, or delegating the execution of the critical section to the host CPU, leading to suboptimal performance. In the multi-core CPU world, transactional memory (TM) was proposed as an alternative to locks to coordinate concurrent threads. Some solutions for GPUs started to appear in the literature. In contrast to these earlier proposals, our approach is to design hardware support for TM in two levels. The first level is a fast and lightweight solution for coordinating threads that share the local memory, while the second level coordinates threads through the global memory. In this paper we present GPU-LocalTM as a hardware TM (HTM) support for the first level. GPU-LocalTM offers simple conflict detection and version management mechanisms that minimize the hardware resources required for its implementation. For the workloads studied, GPU-LocalTM provides between 1.25-80X speedup over serialized critical sections, while the overhead introduced by transaction management is lower than 20%.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

    Multiversioning hardware transactional memory for fail-operational multithreaded applications

    Get PDF
    Modern safety-critical embedded applications like autonomous driving need to be fail-operational, while high performance and low power consumption are demanded simultaneously. The prevalent fault tolerance mechanisms suffer from disadvantages: Some (e.g. triple modular redundancy) require a substantial amount of duplication, resulting in high hardware costs and power consumption. Others, like lockstep, require supplementary checkpointing mechanisms to recover from errors. Further approaches (e.g. software-based process-level redundancy) cannot handle the indeterminism caused by multithreaded execution. This paper presents a novel approach for fail-operational systems using hardware transactional memory for embedded systems. The hardware transactional memory is extended to support multiple versions, enabling redundant atomic operations and recovery in case of an error. In our FPGA-based evaluation, we executed the PARSEC benchmark suite with fault tolerance on 12 cores. The evaluation shows that multiversioning can successfully recover from all transient errors with an overhead comparable to fault tolerance mechanisms without recovery

    New hardware support transactional memory and parallel debugging in multicore processors

    Get PDF
    This thesis contributes to the area of hardware support for parallel programming by introducing new hardware elements in multicore processors, with the aim of improving the performance and optimize new tools, abstractions and applications related with parallel programming, such as transactional memory and data race detectors. Specifically, we configure a hardware transactional memory system with signatures as part of the hardware support, and we develop a new hardware filter for reducing the signature size. We also develop the first hardware asymmetric data race detector (which is also able to tolerate them), based also in hardware signatures. Finally, we propose a new module of hardware signatures that solves some of the problems that we found in the previous tools related with the lack of flexibility in hardware signatures

    Low overhead scheduling of LoRa transmissions for improved scalability

    Get PDF
    Recently, LoRaWAN has attracted much attention for the realization of many Internet of Things applications because it offers low-power, long-distance, and low-cost wireless communication. Recent works have shown that the LoRaWAN specification for class A devices comes with scalability limitations due to the ALOHA-like nature of the MAC layer. In this paper, we propose a synchronization and scheduling mechanism for LoRaWAN networks consisting of class A devices. The mechanism runs on top of the LoRaWAN MAC layer. A central network synchronization and scheduling entity will schedule uplink and downlink transmissions. In order to reduce the synchronization packet length, all time slots that are being assigned to an end node are encoded in a probabilistic space-efficient data structure. An end node will check if a time slot is part of the received data structure in order to determine when to transmit. Time slots are assigned based on the traffic needs of the end nodes. We show that in case of a nonsaturated multichannel LoRaWAN network with synchronization being done in a separate channel, the packet delivery ratio (PDR) is easily 7% (for SF7) to 30% (for SF12) higher than in an unsynchronized LoRaWAN network. For saturated networks, the differences in PDR become more profound as nodes are only scheduled as long as they can be accommodated given the remaining capacity of the network. The synchronization process will use less than 3-mAh extra battery capacity per end node during a one year period, for synchronization periods longer than three days. This is less than the battery capacity used to transmit packets that are going to be lost in an unsynchronized network due to collisions
    corecore