464 research outputs found
Cache-Oblivious Selection in Sorted X+Y Matrices
Let X[0..n-1] and Y[0..m-1] be two sorted arrays, and define the mxn matrix A
by A[j][i]=X[i]+Y[j]. Frederickson and Johnson gave an efficient algorithm for
selecting the k-th smallest element from A. We show how to make this algorithm
IO-efficient. Our cache-oblivious algorithm performs O((m+n)/B) IOs, where B is
the block size of memory transfers
Algorithmic ramifications of prefetching in memory hierarchy
External Memory models, most notable being the I-O Model [3], capture the effects of memory hierarchy and aid in algorithm design. More than a decade of architectural advancements have led to new features not captured in the I-O model - most notably the prefetching capability.
We propose a relatively simple Prefetch model that incorporates data prefetching in the traditional I-O models and show how to design
algorithms that can attain close to peak memory bandwidth. Unlike (the inverse of) memory latency, the memory bandwidth is much closer to the
processing speed, thereby, intelligent use of prefetching can considerably mitigate the I-O bottleneck. For some fundamental problems, our algorithms attain running times approaching that of the idealized Random Access Machines under reasonable assumptions. Our work also explains
the significantly superior performance of the I-O efficient algorithms in systems that support prefetching compared to ones that do not
Efficient GPU Implementation of Affine Index Permutations on Arrays
Optimal usage of the memory system is a key element of fast GPU algorithms. Unfortunately many common algorithms fail in this regard despite exhibiting great regularity in memory access patterns. In this paper we propose efficient kernels to permute the elements of an array. We handle a class of permutations known as Bit Matrix Multiply Complement (BMMC) permutations, for which we design kernels of speed comparable to that of a simple array copy. This is a first step towards implementing a set of array combinators based on these permutations
Efficient GPU implementation of a class of array permutations
Optimal usage of the memory system is a key element of fast GPU algorithms.
Unfortunately many common algorithms fail in this regard despite exhibiting
great regularity in memory access patterns. In this paper we propose efficient
kernels to permute the elements of an array, which can be used to improve the
access patterns of many algorithms. We handle a class of permutations known as
Bit Matrix Multiply Complement (BMMC) permutations, for which we design kernels
of speed comparable to that of a simple array copy. This is a first step
towards implementing a set of array combinators based on these permutations.Comment: Submitted to ACM SIGPLAN International Workshop on Functional
High-Performance and Numerical Computing 202
WASP-43b: The closest-orbiting hot Jupiter
We report the discovery of WASP-43b, a hot Jupiter transiting a K7V star
every 0.81 d. At 0.6-Msun the host star has the lowest mass of any star hosting
a hot Jupiter. It also shows a 15.6-d rotation period. The planet has a mass of
1.8 Mjup, a radius of 0.9 Rjup, and with a semi-major axis of only 0.014 AU has
the smallest orbital distance of any known hot Jupiter. The discovery of such a
planet around a K7V star shows that planets with apparently short remaining
lifetimes owing to tidal decay of the orbit are also found around stars with
deep convection zones.Comment: 4 page
Hierarchical Bin Buffering: Online Local Moments for Dynamic External Memory Arrays
Local moments are used for local regression, to compute statistical measures
such as sums, averages, and standard deviations, and to approximate probability
distributions. We consider the case where the data source is a very large I/O
array of size n and we want to compute the first N local moments, for some
constant N. Without precomputation, this requires O(n) time. We develop a
sequence of algorithms of increasing sophistication that use precomputation and
additional buffer space to speed up queries. The simpler algorithms partition
the I/O array into consecutive ranges called bins, and they are applicable not
only to local-moment queries, but also to algebraic queries (MAX, AVERAGE, SUM,
etc.). With N buffers of size sqrt{n}, time complexity drops to O(sqrt n). A
more sophisticated approach uses hierarchical buffering and has a logarithmic
time complexity (O(b log_b n)), when using N hierarchical buffers of size n/b.
Using Overlapped Bin Buffering, we show that only a single buffer is needed, as
with wavelet-based algorithms, but using much less storage. Applications exist
in multidimensional and statistical databases over massive data sets,
interactive image processing, and visualization
Structured Permuting in Place on Parallel Disk Systems
The ability to perform permutations of large data sets in place reduces the amount of necessary available disk storage. The simplest way to perform a permutation often is to read the records of a data set from a source portion of data storage, permute them in memory, and write them to a separate target portion of the same size. It can be quite expensive, however, to provide disk storage that is twice the size of very large data sets. Permuting in place reduces the expense by using only a small amount of extra disk storage beyond the size of the data set. This paper features in-place algorithms for commonly used structured permutations. We have developed an asymptotically optimal algorithm for performing BMMC (bit-matrix-multiply/complement) permutations in place that requires at most \frac{2N}{BD}\left( 2\ceil{\frac{\rank{\gamma}}{\lg (M/B)}} + \frac{7}{2}\right) parallel disk accesses, as long as , where is the number of records in the data set, is the number of records that can fit in memory, is the number of disks, is the number of records in a block, and is the lower left submatrix of the characteristic matrix for the permutation. This algorithm uses records of disk storage and requires only a constant factor more parallel disk accesses and insignificant additional computation than a previously published asymptotically optimal algorithm that uses records of disk storage. We also give algorithms to perform mesh and torus permutations on a -dimensional mesh. The in-place algorithm for mesh permutations requires at most 3\ceil{N/BD} parallel I/Os and the in-place algorithm for torus permutations uses at most parallel I/Os. The algorithms for mesh and torus permutations require no extra disk space as long as the memory size~ is at least~. The torus algorithm improves upon the previous best algorithm in terms of both time and space
Permuting and Batched Geometric Lower Bounds in the I/O Model
We study permuting and batched orthogonal geometric reporting problems in the External Memory Model (EM), assuming indivisibility of the input records.
Our main results are twofold. First, we prove a general simulation result that essentially shows that any permutation algorithm (resp. duplicate removal algorithm) that does alpha*N/B I/Os (resp. to remove a fraction of the existing duplicates) can be simulated with an algorithm that does alpha phases where each phase reads and writes each element once, but using a factor alpha smaller block size.
Second, we prove two lower bounds for batched rectangle stabbing and batched orthogonal range reporting queries. Assuming a short cache, we prove very high lower bounds that currently are not possible with the existing techniques under the tall cache assumption
- …