4 research outputs found
Efficient Transactional-Memory-based Implementation of Morph Algorithms on GPU
General Purpose GPUs (GPGPUs) are ideal platforms for parallel execution of applications with regular shared memory access patterns. However, majority of real world multithreaded applications require access to shared memory with irregular patterns. The morph algorithms, which arise in many real world applications, change their graph data structures in unpredictable ways, thus, leading to irregular access patterns to shared data. Such irregularity makes morph algorithms more challenging to be implemented on GPUs which favor regularity. The Borouvka’s algorithm for calculating Minimum Spanning Forest (MSF), and multilevel graph partitioning are two examples of morph algorithms with varied levels of expressed parallelism. In this work we show that a transactional-memory-based design and implementation of the morph algorithms on GPUs can handle some of the challenges arising due to irregularities such as complexity of code and overhead of synchronization. First, we identify the major phases of the algorithm which requires synchronization of the shared data. If the algorithm exhibits certain algebraic properties (e.g., monotonicity, idempotency, associativity), we can use lock-free synchronizations for performance; otherwise we utilize a Software Transactional Memory (STM) based synchronization method. Experimental results show that our GPU-based implementation of Borouvka’s algorithm outperforms both the fastest sequential implementation and the existing STM-based implementation on multicore CPUs when tested on large-scale graphs with diverse densities. Moreover, to show the applicability of our approach to other morph algorithms, we do a pen-and-paper implementation and complexity analysis of multilevel graph partitioning
Towards a Software Transactional Memory for Graphics Processors
The introduction of general purpose computing on many-core graphics processor
systems, and the general shift in the industry towards parallelism, has created a demand for ease of parallelization.
Software transactional memory (STM) simplifies development of concurrent code by allowing the
programmer to mark sections of code to be executed concurrently and atomically in an optimistic manner.
In contrast to locks,
STMs are easy to compose and do not suffer from deadlocks.
We have designed and implemented two STMs for graphics processors, one blocking and one non-blocking.
The design issues involved in the designing of these two STMs are described and
explained in the paper together with experimental results comparing the performance of the two STMs
Towards a Software Transactional Memory for Graphics Processors
The introduction of general purpose computing on many-core graphics processorsystems, and the general shift in the industry towards parallelism, has created a demand for ease of parallelization.Software transactional memory (STM) simplifies development of concurrent code by allowing theprogrammer to mark sections of code to be executed concurrently and atomically in an optimistic manner.In contrast to locks,STMs are easy to compose and do not suffer from deadlocks.We have designed and implemented two STMs for graphics processors, one blocking and one non-blocking.The design issues involved in the designing of these two STMs are described andexplained in the paper together with experimental results comparing the performance of the two STMs
Dynamic Orchestration of Massively Data Parallel Execution.
Graphics processing units (GPUs) are specialized hardware accelerators
capable of rendering graphics much faster than conventional
general-purpose processors. They are widely used in personal computers,
tablets, mobile phones, and game consoles. Modern GPUs are not only
efficient at manipulating computer graphics, but also are more effective
than CPUs for algorithms where processing of large data blocks can be done
in parallel. This is mainly due to their highly parallel architecture.
While GPUs provide low-cost and efficient
platforms for accelerating massively parallel applications, tedious
performance tuning is required to maximize application execution
efficiency. Achieving high performance requires the programmers to
manually manage the amount of on-chip memory used per thread, the total
number of threads per multiprocessor, the pattern of off-chip memory
accesses, etc.
In addition to a complex programming model, there is a lack of performance
portability across various systems with different runtime properties. Programmers usually make assumptions about
runtime properties when they write code and optimize that code based
on those assumptions. However, if any of these properties changes
during execution, the optimized code performs poorly. To alleviate these
limitations, several implementations of the application are needed to
maximize performance for different runtime properties. However, it
is not practical for the programmer to write several different versions of the
same code which are optimized for each individual runtime condition.
In this thesis, we propose a static and dynamic compiler framework to
take the burden of fine tuning different implementations of the same code
off the programmer. This framework enables the programmer to write the
program once and allow a static compiler to generate different versions of
a data parallel application with several tuning parameters. The runtime
system selects the best version and fine tunes its parameters based on
runtime properties such as device configuration, input size, dependency,
and data values.PhDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/108805/1/mehrzads_1.pd