Search CORE

19 research outputs found

Asymptotically Tight Bounds for Performing BMMC Permutations on Parallel Disk Systems

Author: Cormen Thomas H
Sundquist Thomas
Wisniewski Leonard F
Publication venue: Dartmouth Digital Commons
Publication date: 01/01/1998
Field of study

This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bit-matrix-multiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an affine transformation over GF(2), where the source and target indices are treated as bit vectors. The class of BMMC permutations includes many common permutations, such as matrix transposition (when dimensions are powers of 2), bit-reversal permutations, vector-reversal permutations, hypercube permutations, matrix reblocking, Gray-code permutations, and inverse Gray-code permutations. The upper bound improves upon the asymptotic bound in the previous best known BMMC algorithm and upon the constant factor in the previous best known bit-permute/complement (BPC) permutation algorithm. The algorithm achieving the upper bound uses basic linear-algebra techniques to factor the characteristic matrix for the BMMC permutation into a product of factors, each of which characterizes a permutation that can be performed in one pass over the data. The factoring uses new subclasses of BMMC permutations: memoryload-dispersal (MLD) permutations and their inverses. These subclasses extend the catalog of one-pass permutations. Although many BMMC permutations of practical interest fall into subclasses that might be explicitly invoked within the source code, this paper shows how to quickly detect whether a given vector of target addresses specifies a BMMC permutation. Thus, one can determine efficiently at run time whether a permutation to be performed is BMMC and then avoid the general-permutation algorithm and save parallel I/Os by using the BMMC permutation algorithm herein

Dartmouth Digital Commons (Dartmouth College)

Asymptotically Tight Bounds for Performing BMMC Permutations on Parallel Disk Systems

Author: Cormen Thomas H
Wisniewski Leonard F
Publication venue: Dartmouth Digital Commons
Publication date: 01/01/1993
Field of study

Dartmouth Digital Commons (Dartmouth College)

Structured Permuting in Place on Parallel Disk Systems

Author: Wisniewski Leonard F
Publication venue: Dartmouth Digital Commons
Publication date: 01/01/1995
Field of study

The ability to perform permutations of large data sets in place reduces the amount of necessary available disk storage. The simplest way to perform a permutation often is to read the records of a data set from a source portion of data storage, permute them in memory, and write them to a separate target portion of the same size. It can be quite expensive, however, to provide disk storage that is twice the size of very large data sets. Permuting in place reduces the expense by using only a small amount of extra disk storage beyond the size of the data set. This paper features in-place algorithms for commonly used structured permutations. We have developed an asymptotically optimal algorithm for performing BMMC (bit-matrix-multiply/complement) permutations in place that requires at most \frac{2N}{BD}\left( 2\ceil{\frac{\rank{\gamma}}{\lg (M/B)}} + \frac{7}{2}\right) parallel disk accesses, as long as

M \geq 2BD

, where

N

is the number of records in the data set,

M

is the number of records that can fit in memory,

D

is the number of disks,

B

is the number of records in a block, and

\gamma

is the lower left

\lg (N/B) \times \lg B

submatrix of the characteristic matrix for the permutation. This algorithm uses

N+M

records of disk storage and requires only a constant factor more parallel disk accesses and insignificant additional computation than a previously published asymptotically optimal algorithm that uses

2N

records of disk storage. We also give algorithms to perform mesh and torus permutations on a

d

-dimensional mesh. The in-place algorithm for mesh permutations requires at most 3\ceil{N/BD} parallel I/Os and the in-place algorithm for torus permutations uses at most

4dN/BD

parallel I/Os. The algorithms for mesh and torus permutations require no extra disk space as long as the memory size~

M

is at least~

3BD

. The torus algorithm improves upon the previous best algorithm in terms of both time and space

CiteSeerX

Dartmouth Digital Commons (Dartmouth College)

Optimizing the Dimensional Method for Performing Multidimensional, Multiprocessor, Out-of-Core FFTs

Author: Fineman Jeremy T
Publication venue: Dartmouth Digital Commons
Publication date: 01/06/2001
Field of study

We present an improved version of the Dimensional Method for computing multidimensional Fast Fourier Transforms (FFTs) on a multiprocessor system when the data consist of too many records to fit into memory. Data are spread across parallel disks and processed in sections. We use the Parallel Disk Model for analysis. The simple Dimensional Method performs the 1-dimensional FFTs for each dimension in term. Between each dimension, an out-of-core permutation is used to rearrange the data to contiguous locations. The improved Dimensional Method processes multiple dimensions at a time. We show that determining an optimal sequence and groupings of dimensions is NP-complete. We then analyze the effects of two modifications to the Dimensional Method independently: processing multiple dimensions at one time, and processing single dimensions in a different order. Finally, we show a lower bound on the I/O complexity of the Dimensional Method and present an algorithm that is approximately asymptotically optimal

Dartmouth Digital Commons (Dartmouth College)

Determining an Out-of-Core FFT Decomposition Strategy for Parallel Disks by Dynamic Programming

Author: Cormen Thomas H
Publication venue: Dartmouth Digital Commons
Publication date: 01/09/1996
Field of study

We present an out-of-core FFT algorithm based on the in-core FFT method developed by Swarztrauber. Our algorithm uses a recursive divide-and-conquer strategy, and each stage in the recursion presents several possibilities for how to split the problem into subproblems. We give a recurrence for the algorithm\u27s I/O complexity on the Parallel Disk Model and show how to use dynamic programming to determine optimal splits at each recursive stage. The algorithm to determine the optimal splits takes only Theta(lg^2 N) time for an N-point FFT, and it is practical. The out-of-core FFT algorithm itself takes considerably longer

Dartmouth Digital Commons (Dartmouth College)

Towards a theory of cache-efficient algorithms

Author: Chatterjee Siddhartha
Sen Sandeep
Publication venue: Society for Industrial and Applied Mathematics Philadelphia
Publication date: 01/01/2000
Field of study

We describe a model that enables us to analyze the running time of an algorithm in a computer with a memory hierarchy with limited associativity, in terms of various cache parameters. Our model, an extension of Aggarwal and Vitter's I/O model, enables us to establish useful relationships between the cache complexity and the I/O complexity of computations. As a corollary, we obtain cache-optimal algorithms for some fundamental problems like sorting, FFT, and an important subclass of permutations in the single-level cache model. We also show that ignoring associativity concerns could lead to inferior performance, by analyzing the average-case cache behavior of mergesort. We further extend our model to multiple levels of cache with limited associativity and present optimal algorithms for matrix transpose and sorting. Our techniques may be used for systematic exploitation of the memory hierarchy starting from the algorithm design stage, and dealing with the hitherto unresolved problem of limited associativity

External-Memory Graph Algorithms

Author: Chiang Yi-Jen
Goodrich Michael T.
Grove Edward F.
Tamassia Roberto
Vengroff Darren Erik
Vitter Jeffrey Scott
Publication venue: 'The Japan Society for Industrial and Applied Mathematics'
Publication date: 16/03/2011
Field of study

We present a collection of new techniques for designing and analyzing efficient external-memory algorithms for graph problems and illustrate how these techniques can be applied to a wide variety of specific problems. Our results include: Proximate-neighboring. We present a simple method for deriving external-memory lower bounds via reductions from a problem we call the “proximate neighbors” problem. We use this technique to derive non-trivial lower bounds for such problems as list ranking, expression tree evaluation, and connected components. PRAM simulation. We give methods for efficiently simulating PRAM computations in external memory, even for some cases in which the PRAM algorithm is not work-optimal. We apply this to derive a number of optimal (and simple) external-memory graph algorithms. Time-forward processing. We present a general technique for evaluating circuits (or “circuit-like” computations) in external memory. We also usethis in a deterministic list ranking algorithm. Deterministic 3-coloring of a cycle. We give several optimal methods for 3-coloring a cycle, which can be used as a subroutine for finding large independent sets for list ranking. Our ideas go beyond a straightforward PRAM simulation, and may be of independent interest. External depth-first search. We discuss a method for performing depth first search and solving related problems efficiently in external memory. Our technique can be used in conjunction with ideas due to Ullman and Yannakakis in order to solve graph problems involving closed semi-ring computations even when their assumption that vertices fit in main memory does not hold. Our techniques apply to a number of problems, including list ranking, which we discuss in detail, finding Euler tours, expression-tree evaluation, centroid decomposition of a tree, least-common ancestors, minimum spanning tree verification, connected and biconnected components, minimum spanning forest, ear decomposition, topological sorting, reachability, graph drawing, and visibility representation

KU ScholarWorks

Algorithmic ramifications of prefetching in memory hierarchy

Author: Sen Sandeep
Verma Akshat
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2006
Field of study

External Memory models, most notable being the I-O Model [3], capture the effects of memory hierarchy and aid in algorithm design. More than a decade of architectural advancements have led to new features not captured in the I-O model - most notably the prefetching capability. We propose a relatively simple Prefetch model that incorporates data prefetching in the traditional I-O models and show how to design algorithms that can attain close to peak memory bandwidth. Unlike (the inverse of) memory latency, the memory bandwidth is much closer to the processing speed, thereby, intelligent use of prefetching can considerably mitigate the I-O bottleneck. For some fundamental problems, our algorithms attain running times approaching that of the idealized Random Access Machines under reasonable assumptions. Our work also explains the significantly superior performance of the I-O efficient algorithms in systems that support prefetching compared to ones that do not

Vector Layout in Virtual-Memory Systems for Data-Parallel Computing

Author: Cormen Thomas H
Publication venue: Dartmouth Digital Commons
Publication date: 01/01/1993
Field of study

In a data-parallel computer with virtual memory, the way in which vectors are laid out on the disk system affects the performance of data-parallel operations. We present a general method of vector layout called banded layout, in which we divide a vector into bands of a number of consecutive vector elements laid out in column-major order, and we analyze the effect of the band size on the major classes of data-parallel operations. We find that although the best band size varies among the operations, choosing fairly small band sizes—at most a track—works well in general

Dartmouth Digital Commons (Dartmouth College)