478 research outputs found
Structured Permuting in Place on Parallel Disk Systems
The ability to perform permutations of large data sets in place reduces the amount of necessary available disk storage. The simplest way to perform a permutation often is to read the records of a data set from a source portion of data storage, permute them in memory, and write them to a separate target portion of the same size. It can be quite expensive, however, to provide disk storage that is twice the size of very large data sets. Permuting in place reduces the expense by using only a small amount of extra disk storage beyond the size of the data set. This paper features in-place algorithms for commonly used structured permutations. We have developed an asymptotically optimal algorithm for performing BMMC (bit-matrix-multiply/complement) permutations in place that requires at most \frac{2N}{BD}\left( 2\ceil{\frac{\rank{\gamma}}{\lg (M/B)}} + \frac{7}{2}\right) parallel disk accesses, as long as , where is the number of records in the data set, is the number of records that can fit in memory, is the number of disks, is the number of records in a block, and is the lower left submatrix of the characteristic matrix for the permutation. This algorithm uses records of disk storage and requires only a constant factor more parallel disk accesses and insignificant additional computation than a previously published asymptotically optimal algorithm that uses records of disk storage. We also give algorithms to perform mesh and torus permutations on a -dimensional mesh. The in-place algorithm for mesh permutations requires at most 3\ceil{N/BD} parallel I/Os and the in-place algorithm for torus permutations uses at most parallel I/Os. The algorithms for mesh and torus permutations require no extra disk space as long as the memory size~ is at least~. The torus algorithm improves upon the previous best algorithm in terms of both time and space
Asymptotically Tight Bounds for Performing BMMC Permutations on Parallel Disk Systems
This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bit-matrix-multiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an affine transformation over GF(2), where the source and target indices are treated as bit vectors. The class of BMMC permutations includes many common permutations, such as matrix transposition (when dimensions are powers of 2), bit-reversal permutations, vector-reversal permutations, hypercube permutations, matrix reblocking, Gray-code permutations, and inverse Gray-code permutations. The upper bound improves upon the asymptotic bound in the previous best known BMMC algorithm and upon the constant factor in the previous best known bit-permute/complement (BPC) permutation algorithm. The algorithm achieving the upper bound uses basic linear-algebra techniques to factor the characteristic matrix for the BMMC permutation into a product of factors, each of which characterizes a permutation that can be performed in one pass over the data.
The factoring uses new subclasses of BMMC permutations: memoryload-dispersal (MLD) permutations and their inverses. These subclasses extend the catalog of one-pass permutations.
Although many BMMC permutations of practical interest fall into subclasses that might be explicitly invoked within the source code, this paper shows how to quickly detect whether a given vector of target addresses specifies a BMMC permutation. Thus, one can determine efficiently at run time whether a permutation to be performed is BMMC and then avoid the general-permutation algorithm and save parallel I/Os by using the BMMC permutation algorithm herein
Distributed Triangle Counting in the Graphulo Matrix Math Library
Triangle counting is a key algorithm for large graph analysis. The Graphulo
library provides a framework for implementing graph algorithms on the Apache
Accumulo distributed database. In this work we adapt two algorithms for
counting triangles, one that uses the adjacency matrix and another that also
uses the incidence matrix, to the Graphulo library for server-side processing
inside Accumulo. Cloud-based experiments show a similar performance profile for
these different approaches on the family of power law Graph500 graphs, for
which data skew increasingly bottlenecks. These results motivate the design of
skew-aware hybrid algorithms that we propose for future work.Comment: Honorable mention in the 2017 IEEE HPEC's Graph Challeng
Permuting and Batched Geometric Lower Bounds in the I/O Model
We study permuting and batched orthogonal geometric reporting problems in the External Memory Model (EM), assuming indivisibility of the input records.
Our main results are twofold. First, we prove a general simulation result that essentially shows that any permutation algorithm (resp. duplicate removal algorithm) that does alpha*N/B I/Os (resp. to remove a fraction of the existing duplicates) can be simulated with an algorithm that does alpha phases where each phase reads and writes each element once, but using a factor alpha smaller block size.
Second, we prove two lower bounds for batched rectangle stabbing and batched orthogonal range reporting queries. Assuming a short cache, we prove very high lower bounds that currently are not possible with the existing techniques under the tall cache assumption
Large-Scale Discrete Fourier Transform on TPUs
In this work, we present two parallel algorithms for the large-scale discrete
Fourier transform (DFT) on Tensor Processing Unit (TPU) clusters. The two
parallel algorithms are associated with two formulations of DFT: one is based
on the Kronecker product, to be specific, dense matrix multiplications between
the input data and the Vandermonde matrix, denoted as KDFT in this work; the
other is based on the famous Cooley-Tukey algorithm and phase adjustment,
denoted as FFT in this work. Both KDFT and FFT formulations take full advantage
of TPU's strength in matrix multiplications. The KDFT formulation allows direct
use of nonuniform inputs without additional step. In the two parallel
algorithms, the same strategy of data decomposition is applied to the input
data. Through the data decomposition, the dense matrix multiplications in KDFT
and FFT are kept local within TPU cores, which can be performed completely in
parallel. The communication among TPU cores is achieved through the one-shuffle
scheme in both parallel algorithms, with which sending and receiving data takes
place simultaneously between two neighboring cores and along the same direction
on the interconnect network. The one-shuffle scheme is designed for the
interconnect topology of TPU clusters, minimizing the time required by the
communication among TPU cores. Both KDFT and FFT are implemented in TensorFlow.
The three-dimensional complex DFT is performed on an example of dimension with a full TPU Pod: the run time of KDFT is 12.66
seconds and that of FFT is 8.3 seconds. Scaling analysis is provided to
demonstrate the high parallel efficiency of the two DFT implementations on
TPUs
- …