6 research outputs found
Run Generation Revisited: What Goes Up May or May Not Come Down
In this paper, we revisit the classic problem of run generation. Run
generation is the first phase of external-memory sorting, where the objective
is to scan through the data, reorder elements using a small buffer of size M ,
and output runs (contiguously sorted chunks of elements) that are as long as
possible.
We develop algorithms for minimizing the total number of runs (or
equivalently, maximizing the average run length) when the runs are allowed to
be sorted or reverse sorted. We study the problem in the online setting, both
with and without resource augmentation, and in the offline setting.
(1) We analyze alternating-up-down replacement selection (runs alternate
between sorted and reverse sorted), which was studied by Knuth as far back as
1963. We show that this simple policy is asymptotically optimal. Specifically,
we show that alternating-up-down replacement selection is 2-competitive and no
deterministic online algorithm can perform better.
(2) We give online algorithms having smaller competitive ratios with resource
augmentation. Specifically, we exhibit a deterministic algorithm that, when
given a buffer of size 4M , is able to match or beat any optimal algorithm
having a buffer of size M . Furthermore, we present a randomized online
algorithm which is 7/4-competitive when given a buffer twice that of the
optimal.
(3) We demonstrate that performance can also be improved with a small amount
of foresight. We give an algorithm, which is 3/2-competitive, with
foreknowledge of the next 3M elements of the input stream. For the extreme case
where all future elements are known, we design a PTAS for computing the optimal
strategy a run generation algorithm must follow.
(4) Finally, we present algorithms tailored for nearly sorted inputs which
are guaranteed to have optimal solutions with sufficiently long runs
Parallel Out-of-Core Sorting: The Third Way
Sorting very large datasets is a key subroutine in almost any application that is built on top of a large database. Two ways to sort out-of-core data dominate the literature: merging-based algorithms and partitioning-based algorithms. Within these two paradigms, all the programs that sort out-of-core data on a cluster rely on assumptions about the input distribution. We propose a third way of out-of-core sorting: oblivious algorithms. In all, we have developed six programs that sort out-of-core data on a cluster. The first three programs, based completely on Leighton\u27s columnsort algorithm, have a restriction on the maximum problem size that they can sort. The other three programs relax this restriction; two are based on our original algorithmic extensions to columnsort. We present experimental results to show that our algorithms perform well. To the best of our knowledge, the programs presented in this thesis are the first to sort out-of-core data on a cluster without making any simplifying assumptions about the distribution of the data to be sorted
Two-way replacement selection
The performance of external sorting is highly dependant on the length of the runs generated.
One of the most commonly used run generation strategies is Replacement Selection (RS) because,
on average, it generates runs that are twice the size of the memory available.
However, the length of the runs generated by RS is downsized for data with certain characteristics,like inputs sorted inversely with respect to the desired output order.
The goal of this project is to propose and analyze two-way replacement selection (2WRS),
which is a generalization of RS obtained by implementing two heaps instead of the single
heap implemented by RS. The appropriate management of these two heaps allows generating runs larger than the memory available in a stable way, i.e. independent from the characteristics of the datasets.
Depending on the changing characteristics of the input dataset,
2WRS assigns a new data record to one or the other heap, and grows or shrinks each heap,
accommodating to the growing or decreasing tendency of the dataset.
On average, 2WRS creates runs of at least the length generated by RS,
and longer for datasets that combine increasing and decreasing data subsets.
We tested both algorithms on large datasets with different characteristics
and 2WRS achieves speedups at least similar to RS, and over 2.5 when RS fails
to generate large runs.
. El projecte consisteix en desenvolupar un algorisme d'ordenació externa basat en Replacement Selection, de manera que solucioni els problemes inherents a replacement selection.
L'estudiant haurà de dissenyar i implementar l'algorisme, fer un estudi estadÃstic de la seva eficiència, i comparar la eficiència en temps del nou algorisme amb replacement selection
Two-way replacement selection
The performance of external sorting is highly dependant on the length of the runs generated.
One of the most commonly used run generation strategies is Replacement Selection (RS) because,
on average, it generates runs that are twice the size of the memory available.
However, the length of the runs generated by RS is downsized for data with certain characteristics,like inputs sorted inversely with respect to the desired output order.
The goal of this project is to propose and analyze two-way replacement selection (2WRS),
which is a generalization of RS obtained by implementing two heaps instead of the single
heap implemented by RS. The appropriate management of these two heaps allows generating runs larger than the memory available in a stable way, i.e. independent from the characteristics of the datasets.
Depending on the changing characteristics of the input dataset,
2WRS assigns a new data record to one or the other heap, and grows or shrinks each heap,
accommodating to the growing or decreasing tendency of the dataset.
On average, 2WRS creates runs of at least the length generated by RS,
and longer for datasets that combine increasing and decreasing data subsets.
We tested both algorithms on large datasets with different characteristics
and 2WRS achieves speedups at least similar to RS, and over 2.5 when RS fails
to generate large runs.
. El projecte consisteix en desenvolupar un algorisme d'ordenació externa basat en Replacement Selection, de manera que solucioni els problemes inherents a replacement selection.
L'estudiant haurà de dissenyar i implementar l'algorisme, fer un estudi estadÃstic de la seva eficiència, i comparar la eficiència en temps del nou algorisme amb replacement selection
Dynamic Memory Adjustment for External Mergesort
Sorting is a memory intensive operation whose performance is greatly affected by the amount of memory available as work space. When the input size is unknown or available memory space varies, static memory allocation either wastes memory space or fails to make full use of memory to speed up sorting. This paper presents a method for run-time adjustment of in-memory work space for external mergesort and a policy for allocating memory among concurrent, competing sorts. Experimental results confirm that the new method enables sorts to adapt their memory usage gracefully to the actual input size and available memory space. When multiple sorts compete for memory resources, we found that sort throughput and response time are improved significantly by our policy for memory allocation combined with limiting the number of sorts processed concurrently. 1 Introduction Sorts and joins are memory intensive operations whose performance is greatly affected by the amount of main memory work space ava..