5,082 research outputs found
Run Generation Revisited: What Goes Up May or May Not Come Down
In this paper, we revisit the classic problem of run generation. Run
generation is the first phase of external-memory sorting, where the objective
is to scan through the data, reorder elements using a small buffer of size M ,
and output runs (contiguously sorted chunks of elements) that are as long as
possible.
We develop algorithms for minimizing the total number of runs (or
equivalently, maximizing the average run length) when the runs are allowed to
be sorted or reverse sorted. We study the problem in the online setting, both
with and without resource augmentation, and in the offline setting.
(1) We analyze alternating-up-down replacement selection (runs alternate
between sorted and reverse sorted), which was studied by Knuth as far back as
1963. We show that this simple policy is asymptotically optimal. Specifically,
we show that alternating-up-down replacement selection is 2-competitive and no
deterministic online algorithm can perform better.
(2) We give online algorithms having smaller competitive ratios with resource
augmentation. Specifically, we exhibit a deterministic algorithm that, when
given a buffer of size 4M , is able to match or beat any optimal algorithm
having a buffer of size M . Furthermore, we present a randomized online
algorithm which is 7/4-competitive when given a buffer twice that of the
optimal.
(3) We demonstrate that performance can also be improved with a small amount
of foresight. We give an algorithm, which is 3/2-competitive, with
foreknowledge of the next 3M elements of the input stream. For the extreme case
where all future elements are known, we design a PTAS for computing the optimal
strategy a run generation algorithm must follow.
(4) Finally, we present algorithms tailored for nearly sorted inputs which
are guaranteed to have optimal solutions with sufficiently long runs
A Bicriteria Approximation for the Reordering Buffer Problem
In the reordering buffer problem (RBP), a server is asked to process a
sequence of requests lying in a metric space. To process a request the server
must move to the corresponding point in the metric. The requests can be
processed slightly out of order; in particular, the server has a buffer of
capacity k which can store up to k requests as it reads in the sequence. The
goal is to reorder the requests in such a manner that the buffer constraint is
satisfied and the total travel cost of the server is minimized. The RBP arises
in many applications that require scheduling with a limited buffer capacity,
such as scheduling a disk arm in storage systems, switching colors in paint
shops of a car manufacturing plant, and rendering 3D images in computer
graphics.
We study the offline version of RBP and develop bicriteria approximations.
When the underlying metric is a tree, we obtain a solution of cost no more than
9OPT using a buffer of capacity 4k + 1 where OPT is the cost of an optimal
solution with buffer capacity k. Constant factor approximations were known
previously only for the uniform metric (Avigdor-Elgrabli et al., 2012). Via
randomized tree embeddings, this implies an O(log n) approximation to cost and
O(1) approximation to buffer size for general metrics. Previously the best
known algorithm for arbitrary metrics by Englert et al. (2007) provided an
O(log^2 k log n) approximation without violating the buffer constraint.Comment: 13 page
Efficient Processing of Spatial Joins Using R-Trees
Abstract: In this paper, we show that spatial joins are very suitable to be processed on a parallel hardware platform. The parallel system is equipped with a so-called shared virtual memory which is well-suited for the design and implementation of parallel spatial join algorithms. We start with an algorithm that consists of three phases: task creation, task assignment and parallel task execu-tion. In order to reduce CPU- and I/O-cost, the three phases are processed in a fashion that pre-serves spatial locality. Dynamic load balancing is achieved by splitting tasks into smaller ones and reassigning some of the smaller tasks to idle processors. In an experimental performance compar-ison, we identify the advantages and disadvantages of several variants of our algorithm. The most efficient one shows an almost optimal speed-up under the assumption that the number of disks is sufficiently large. Topics: spatial database systems, parallel database systems
Gerbil: A Fast and Memory-Efficient -mer Counter with GPU-Support
A basic task in bioinformatics is the counting of -mers in genome strings.
The -mer counting problem is to build a histogram of all substrings of
length in a given genome sequence. We present the open source -mer
counting software Gerbil that has been designed for the efficient counting of
-mers for . Given the technology trend towards long reads of
next-generation sequencers, support for large becomes increasingly
important. While existing -mer counting tools suffer from excessive memory
resource consumption or degrading performance for large , Gerbil is able to
efficiently support large without much loss of performance. Our software
implements a two-disk approach. In the first step, DNA reads are loaded from
disk and distributed to temporary files that are stored at a working disk. In a
second step, the temporary files are read again, split into -mers and
counted via a hash table approach. In addition, Gerbil can optionally use GPUs
to accelerate the counting step. For large , we outperform state-of-the-art
open source -mer counting tools for large genome data sets.Comment: A short version of this paper will appear in the proceedings of WABI
201
- âŠ