3,305 research outputs found
An Efficient Multiway Mergesort for GPU Architectures
Sorting is a primitive operation that is a building block for countless
algorithms. As such, it is important to design sorting algorithms that approach
peak performance on a range of hardware architectures. Graphics Processing
Units (GPUs) are particularly attractive architectures as they provides massive
parallelism and computing power. However, the intricacies of their compute and
memory hierarchies make designing GPU-efficient algorithms challenging. In this
work we present GPU Multiway Mergesort (MMS), a new GPU-efficient multiway
mergesort algorithm. MMS employs a new partitioning technique that exposes the
parallelism needed by modern GPU architectures. To the best of our knowledge,
MMS is the first sorting algorithm for the GPU that is asymptotically optimal
in terms of global memory accesses and that is completely free of shared memory
bank conflicts.
We realize an initial implementation of MMS, evaluate its performance on
three modern GPU architectures, and compare it to competitive implementations
available in state-of-the-art GPU libraries. Despite these implementations
being highly optimized, MMS compares favorably, achieving performance
improvements for most random inputs. Furthermore, unlike MMS, state-of-the-art
algorithms are susceptible to bank conflicts. We find that for certain inputs
that cause these algorithms to incur large numbers of bank conflicts, MMS can
achieve up to a 37.6% speedup over its fastest competitor. Overall, even though
its current implementation is not fully optimized, due to its efficient use of
the memory hierarchy, MMS outperforms the fastest comparison-based sorting
implementations available to date
A Scalable VLSI Architecture for Soft-Input Soft-Output Depth-First Sphere Decoding
Multiple-input multiple-output (MIMO) wireless transmission imposes huge
challenges on the design of efficient hardware architectures for iterative
receivers. A major challenge is soft-input soft-output (SISO) MIMO demapping,
often approached by sphere decoding (SD). In this paper, we introduce the - to
our best knowledge - first VLSI architecture for SISO SD applying a single
tree-search approach. Compared with a soft-output-only base architecture
similar to the one proposed by Studer et al. in IEEE J-SAC 2008, the
architectural modifications for soft input still allow a one-node-per-cycle
execution. For a 4x4 16-QAM system, the area increases by 57% and the operating
frequency degrades by 34% only.Comment: Accepted for IEEE Transactions on Circuits and Systems II Express
Briefs, May 2010. This draft from April 2010 will not be updated any more.
Please refer to IEEE Xplore for the final version. *) The final publication
will appear with the modified title "A Scalable VLSI Architecture for
Soft-Input Soft-Output Single Tree-Search Sphere Decoding
A taxonomy of parallel sorting
TR 84-601In this paper, we propose a taxonomy of parallel sorting that includes a broad range of array
and file sorting algorithms. We analyze the evolution of research on parallel sorting, from the
earliest sorting networks to the shared memory algorithms and the VLSI sorters. In the context
of sorting networks, we describe two fundamental parallel merging schemes - the odd-even and
the bitonic merge. Sorting algorithms have been derived from these merging algorithms for parallel
computers where processors communicate through interconnection networks such as the perfect
shuffle, the mesh and a number of other sparse networks. After describing the network sorting
algorithms, we show that, with a shared memory model of parallel computation, faster algorithms
have been derived from parallel enumeration sorting schemes, where keys are first ranked and
then rearranged according to their rank
Evaluating local indirect addressing in SIMD proc essors
In the design of parallel computers, there exists a tradeoff between the number and power of individual processors. The single instruction stream, multiple data stream (SIMD) model of parallel computers lies at one extreme of the resulting spectrum. The available hardware resources are devoted to creating the largest possible number of processors, and consequently each individual processor must use the fewest possible resources. Disagreement exists as to whether SIMD processors should be able to generate addresses individually into their local data memory, or all processors should access the same address. The tradeoff is examined between the increased capability and the reduced number of processors that occurs in this single instruction stream, multiple, locally addressed, data (SIMLAD) model. Factors are assembled that affect this design choice, and the SIMLAD model is compared with the bare SIMD and the MIMD models
Active data structures on GPGPUs
Active data structures support operations that may affect a large number of elements of an aggregate data structure. They are well suited for extremely fine grain parallel systems, including circuit parallelism. General purpose GPUs were designed to support regular graphics algorithms, but their intermediate level of granularity makes them potentially viable also for active data structures. We consider the characteristics of active data structures and discuss the feasibility of implementing them on GPGPUs. We describe the GPU implementations of two such data structures (ESF arrays and index intervals), assess their performance, and discuss the potential of active data structures as an unconventional programming model that can exploit the capabilities of emerging fine grain architectures such as GPUs
- …