592 research outputs found
Sparse Tensor Transpositions
We present a new algorithm for transposing sparse tensors called Quesadilla.
The algorithm converts the sparse tensor data structure to a list of
coordinates and sorts it with a fast multi-pass radix algorithm that exploits
knowledge of the requested transposition and the tensors input partial
coordinate ordering to provably minimize the number of parallel partial sorting
passes. We evaluate both a serial and a parallel implementation of Quesadilla
on a set of 19 tensors from the FROSTT collection, a set of tensors taken from
scientific and data analytic applications. We compare Quesadilla and a
generalization, Top-2-sadilla to several state of the art approaches, including
the tensor transposition routine used in the SPLATT tensor factorization
library. In serial tests, Quesadilla was the best strategy for 60% of all
tensor and transposition combinations and improved over SPLATT by at least 19%
in half of the combinations. In parallel tests, at least one of Quesadilla or
Top-2-sadilla was the best strategy for 52% of all tensor and transposition
combinations.Comment: This work will be the subject of a brief announcement at the 32nd ACM
Symposium on Parallelism in Algorithms and Architectures (SPAA '20
Parallel Integer Sort: Theory and Practice
Integer sorting is a fundamental problem in computer science. This paper
studies parallel integer sort both in theory and in practice. In theory, we
show tighter bounds for a class of existing practical integer sort algorithms,
which provides a solid theoretical foundation for their widespread usage in
practice and strong performance. In practice, we design a new integer sorting
algorithm, \textsf{DovetailSort}, that is theoretically-efficient and has good
practical performance.
In particular, \textsf{DovetailSort} overcomes a common challenge in existing
parallel integer sorting algorithms, which is the difficulty of detecting and
taking advantage of duplicate keys. The key insight in \textsf{DovetailSort} is
to combine algorithmic ideas from both integer- and comparison-sorting
algorithms. In our experiments, \textsf{DovetailSort} achieves competitive or
better performance than existing state-of-the-art parallel integer and
comparison sorting algorithms on various synthetic and real-world datasets
GPU-ArraySort: A parallel, in-place algorithm for sorting large number of arrays
Modern day analytics deals with big datasets from diverse fields. For many application the data is in the form of an array which consists of large number of smaller arrays. Existing techniques focus on sorting a single large array and cannot be used for sorting large number of smaller arrays in an efficient manner. Currently no such algorithm is available which can sort such large number of arrays utilizing the massively parallel architecture of GPU devices. In this paper we present a highly scalable parallel algorithm, called GPU-ArraySort, for sorting large number of arrays using a GPU. Our algorithm performs in-place operations and makes minimum use of any temporary run-time memory. Our results indicate that we can sort up to 2 million arrays having 1000 elements each, within few seconds. We compare our results with the unorthodox tagged array sorting technique based on NVIDIAs Thrust library. GPU-ArraySort out-performs the tagged array sorting technique by sorting three times more data in a much smaller time. The developed tool and strategy will be made available at https://github.com/pcdslab
GPU LSM: A Dynamic Dictionary Data Structure for the GPU
We develop a dynamic dictionary data structure for the GPU, supporting fast
insertions and deletions, based on the Log Structured Merge tree (LSM). Our
implementation on an NVIDIA K40c GPU has an average update (insertion or
deletion) rate of 225 M elements/s, 13.5x faster than merging items into a
sorted array. The GPU LSM supports the retrieval operations of lookup, count,
and range query operations with an average rate of 75 M, 32 M and 23 M
queries/s respectively. The trade-off for the dynamic updates is that the
sorted array is almost twice as fast on retrievals. We believe that our GPU LSM
is the first dynamic general-purpose dictionary data structure for the GPU.Comment: 11 pages, accepted to appear on the Proceedings of IEEE International
Parallel and Distributed Processing Symposium (IPDPS'18
Fast Sorting on a Distributed-Memory Architecture
We consider the often-studied problem of sorting, for a parallel computer. Given an input array distributed evenly over p processors, the task is to compute the sorted output array, also distributed over the p processors. Many existing algorithms take the approach of approximately load-balancing the output, leaving each processor with Θ(n/p) elements. However, in many cases, approximate load-balancing leads to inefficiencies in both the sorting itself and in further uses of the data after sorting. We provide a deterministic parallel sorting algorithm that uses parallel selection to produce any output distribution exactly, particularly one that is perfectly load-balanced. Furthermore, when using a comparison sort, this algorithm is 1-optimal in both computation and communication. We provide an empirical study that illustrates the efficiency of exact data splitting, and shows an improvement over two sample sort algorithms.Singapore-MIT Alliance (SMA
- …