44 research outputs found
BifurKTM: Approximately Consistent Distributed Transactional Memory for GPUs
We present BifurKTM, the first read-optimized Distributed Transactional Memory system for GPU clusters. The BifurKTM design includes: GPU KoSTM, a new software transactional memory conflict detection scheme that exploits relaxed consistency to increase throughput; and KoDTM, a Distributed Transactional Memory model that combines the Data- and Control- flow models to greatly reduce communication overheads.
Despite the allure of huge speedups, GPUs are limited in use due to their programmability and extreme sensitivity to workload characteristics. These become daunting concerns when considering a distributed GPU cluster, wherein a programmer must design algorithms to hide communication latency by exploiting data regularity, high compute intensity, etc. The BifurKTM design allows GPU programmers to exploit a new workload characteristic: the percentage of the workload that is Read-Only (e.g. reads but does not modify shared memory), even when this percentage is not known in advance. Programmers designate transactions that are suitable for Approximate Consistency, in which transactions "appear" to execute at the most convenient time for preventing conflicts. By leveraging Approximate Consistency for Read-Only transactions, the BifurKTM runtime system offers improved performance, application flexibility, and programmability without introducing any errors into shared memory.
Our experiments show that Approximate Consistency can improve BkTM performance by up to 34x in applications with moderate network communication utilization and a read-intensive workload. Using Approximate Consistency, BkTM can reduce GPU-to-GPU network communication by 99%, reduce the number of aborts by up to 100%, and achieve an average speedup of 18x over a similarly sized CPU cluster while requiring minimal effort from the programmer
Program Partitioning and Synchronization on Multiprocessor Systems (Parallel, Computer Architecture, Compiler)
170 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 1986.Since the mid 1970's, vector machines have dominated the supercomputer market. Because of technological limitations, faster circuits and more levels of pipelining of vector processors can no longer satisfy the increasing demand for high-speed computation. Multiprocessing problems in parallel is a natural trend. Five essential issues are identified in solving problems on multiprocessor systems: control structure, program partitioning, scheduling, synchronization, and memory access. The solutions of these problems determine the performance and efficiency of future multiprocessor machines.This thesis introduces new solutions for the synchronization and partitioning problems. The bit-map method synchronizes concurrent executing processes at the data level. Each synchronized data element has an attached sync field, and each synchronization memory operation contains a mask value. The data can be accessed only when the mask matches the sync. A proper referencing order for the data can be maintained. Two factors are considered when partitioning a program into processes executing in parallel: amount of parallelism and memory access and synchronization overhead. This thesis introduces a minimum distance method which partitions a recurrence loop into independent execution sets. This method uses the minimum dependence distance of each dimension of all dependence cycles to divide the index set of the loop into independent partitions. When a loop does not have a sufficient number of independent sets, the block and the interleaved methods partition the loop using a proper synchronization mechanism. A programmer assistance tool helps programmers in using these partitioning and synchronization methods. A simulator in the tool compares different partitioning strategies. This programmer assistance methodology allows users to explore several algorithms and select one which fits the appropriate multiprocessor architecture.U of I OnlyRestricted to the U of I community idenfinitely during batch ingest of legacy ETD
A High Performance Hybrid Architecture For Concurrent Query Execution
The most debated architectures for parallel database processing are shared nothing (SN) and shared everything (SE) structures. Although SN is considered to be most scalable, it is very sensitive to the data skew problem. On the other hand, SE allows the collaborating processors to share the work load more efficiently. It, however, suffers from the limitation of the memory and disk I/O bandwidth. The authors present a hybrid architecture in which SE clusters are interconnected through a communication network to form a SN structure at the inter-cluster level. Processing elements are clustered into SE systems to minimize the skew effect. Each cluster, however, is kept within the limitation of the memory and I/O technology to avoid the data access bottleneck. A generalized performance model was developed to perform sensitivity analysis for the hybrid structure, and to compare it against SE and SN organizations
A New Address-Free Memory Hierarchy Layer for Zero-Cycle Load
Data communications between producer instructions and consumer instructions through memory incur extra delays that degrade processor performance. In this paper, we introduce a new storage media with a novel addressing mechanism to avoid address calculations. Instead of a memory address, each load and store is assigned a signature for accessing the new storage. A signature consists of the color of the base register along with its displacement value. To represent distinct register content, a unique color is assigned to a register whenever the register is updated. When two memory instructions have the same signature, they address to the same memory location. This memory signature can be formed early in the processor pipeline. For fast data communication, a small Signature Buffer, addressed by the memory signature, can be established to permit stores and loads bypassing normal memory hierarchy. Performance evaluations based on an Alpha 21264-like pipeline using SPEC2000 benchmarks show that an IPC (Instruction-Per-Cycle) improvement of 12-17 % is possible using a small 8-entry signature buffer. 1
Signature buffer: Bridging performance gap between registers and caches
Data communications between producer instructions and consumer instructions through memory incur extra delays that degrade processor performance. In this paper, we introduce a new storage media with a novel addressing mechanism to avoid address calculations. Instead of a memory address, each load and store is assigned a signature for accessing the new storage. A signature consists of the color of the base register along with its displacement value. A unique color is assigned to a register whenever the register is updated. When two memory instructions have the same signature, they address to the same memory location. This memory signature can be formed early in the processor pipeline. A small Signature Buffer, addressed by the memory signature, can be established to permit stores and loads bypassing normal memory hierarchy for fast data communication. Performance evaluations based on an Alpha 21264-like pipeline using SPEC2000 integer benchmarks show that an IPC (Instruction-Per-Cycle) improvement of 13-18 % is possible using a small 8-entry signature buffer. 1