12 research outputs found
On The I/O Complexity of Dynamic Distinct Counting
In dynamic distinct counting, we want to maintain a multi-set S of integers under insertions to answer efficiently the query: how many distinct elements are there in S? In external memory, the problem admits two standard solutions. The first one maintains in a hash structure, so that the distinct count can be incrementally updated after each insertion using O(1) expected I/Os. A query is answered for free. The second one stores S in a linked list, and thus supports an insertion in O(1/B) amortized I/Os. A query can be answered in O(N/B log_{M/B} (N/B)) I/Os by sorting, where N=|S|, B is the block size, and M is the memory size.
In this paper, we show that the above two naive solutions are already optimal within a polylog factor. Specifically, for any Las Vegas structure using N^{O(1)} blocks, if its expected amortized insertion cost is o(1/log B}), then it must incur Omega(N/(B log B)) expected I/Os answering a query in the worst case, under the (realistic) condition that N is a polynomial of B. This means that the problem is repugnant to update buffering: the query cost jumps from 0 dramatically to almost linearity as soon as the insertion cost drops slightly below Omega(1)
Permuting and Batched Geometric Lower Bounds in the I/O Model
We study permuting and batched orthogonal geometric reporting problems in the External Memory Model (EM), assuming indivisibility of the input records.
Our main results are twofold. First, we prove a general simulation result that essentially shows that any permutation algorithm (resp. duplicate removal algorithm) that does alpha*N/B I/Os (resp. to remove a fraction of the existing duplicates) can be simulated with an algorithm that does alpha phases where each phase reads and writes each element once, but using a factor alpha smaller block size.
Second, we prove two lower bounds for batched rectangle stabbing and batched orthogonal range reporting queries. Assuming a short cache, we prove very high lower bounds that currently are not possible with the existing techniques under the tall cache assumption
I/O-Efficient Planar Range Skyline and Attrition Priority Queues
In the planar range skyline reporting problem, we store a set P of n 2D
points in a structure such that, given a query rectangle Q = [a_1, a_2] x [b_1,
b_2], the maxima (a.k.a. skyline) of P \cap Q can be reported efficiently. The
query is 3-sided if an edge of Q is grounded, giving rise to two variants:
top-open (b_2 = \infty) and left-open (a_1 = -\infty) queries.
All our results are in external memory under the O(n/B) space budget, for
both the static and dynamic settings:
* For static P, we give structures that answer top-open queries in O(log_B n
+ k/B), O(loglog_B U + k/B), and O(1 + k/B) I/Os when the universe is R^2, a U
x U grid, and a rank space grid [O(n)]^2, respectively (where k is the number
of reported points). The query complexity is optimal in all cases.
* We show that the left-open case is harder, such that any linear-size
structure must incur \Omega((n/B)^e + k/B) I/Os for a query. We show that this
case is as difficult as the general 4-sided queries, for which we give a static
structure with the optimal query cost O((n/B)^e + k/B).
* We give a dynamic structure that supports top-open queries in O(log_2B^e
(n/B) + k/B^1-e) I/Os, and updates in O(log_2B^e (n/B)) I/Os, for any e
satisfying 0 \le e \le 1. This leads to a dynamic structure for 4-sided queries
with optimal query cost O((n/B)^e + k/B), and amortized update cost O(log
(n/B)).
As a contribution of independent interest, we propose an I/O-efficient
version of the fundamental structure priority queue with attrition (PQA). Our
PQA supports FindMin, DeleteMin, and InsertAndAttrite all in O(1) worst case
I/Os, and O(1/B) amortized I/Os per operation.
We also add the new CatenateAndAttrite operation that catenates two PQAs in
O(1) worst case and O(1/B) amortized I/Os. This operation is a non-trivial
extension to the classic PQA of Sundar, even in internal memory.Comment: Appeared at PODS 2013, New York, 19 pages, 10 figures. arXiv admin
note: text overlap with arXiv:1208.4511, arXiv:1207.234
Efficient bulk-loading methods for temporal and multidimensional index structures
Nahezu alle naturwissenschaftlichen Bereiche profitieren von neuesten Analyse- und Verarbeitungsmethoden fĂĽr groĂźe Datenmengen. Diese Verfahren setzten eine effiziente Verarbeitung von geo- und zeitbezogenen Daten voraus, da die Zeit und die Position wichtige Attribute vieler Daten
sind. Die effiziente Anfrageverarbeitung wird insbesondere durch den Einsatz von Indexstrukturen
ermöglicht. Im Fokus dieser Arbeit liegen zwei Indexstrukturen: Multiversion B-Baum
(MVBT) und R-Baum. Die erste Struktur wird fĂĽr die Verwaltung von zeitbehafteten Daten,
die zweite fĂĽr die Indexierung von mehrdimensionalen Rechteckdaten eingesetzt.
Ständig- und schnellwachsendes Datenvolumen stellt eine große Herausforderung an die Informatik
dar. Der Aufbau und das Aktualisieren von Indexen mit herkömmlichen Methoden (Datensatz
fĂĽr Datensatz) ist nicht mehr effizient. Um zeitnahe und kosteneffiziente Datenverarbeitung
zu ermöglichen, werden Verfahren zum schnellen Laden von Indexstrukturen dringend benötigt.
Im ersten Teil der Arbeit widmen wir uns der Frage, ob es ein Verfahren fĂĽr das Laden von MVBT
existiert, das die gleiche I/O-Komplexität wie das externe Sortieren besitz. Bis jetzt blieb diese
Frage unbeantwortet. In dieser Arbeit haben wir eine neue Kostruktionsmethode entwickelt und
haben gezeigt, dass diese gleiche Zeitkomplexität wie das externe Sortieren besitzt. Dabei haben
wir zwei algorithmische Techniken eingesetzt: Gewichts-Balancierung und Puffer-Bäume. Unsere
Experimenten zeigen, dass das Resultat nicht nur theoretischer Bedeutung ist.
Im zweiten Teil der Arbeit beschäftigen wir uns mit der Frage, ob und wie statistische Informationen
über Geo-Anfragen ausgenutzt werden können, um die Anfrageperformanz von R-Bäumen zu
verbessern. Unsere neue Methode verwendet Informationen wie Seitenverhältnis und Seitenlängen
eines repräsentativen Anfragerechtecks, um einen guten R-Baum bezüglich eines häufig eingesetzten
Kostenmodells aufzubauen. Falls diese Informationen nicht verfĂĽgbar sind, optimieren
wir R-Bäume bezüglich der Summe der Volumina von minimal umgebenden Rechtecken der Blattknoten.
Da das Problem des Aufbaus von optimalen R-Bäumen bezüglich dieses Kostenmaßes
NP-hart ist, führen wir zunächst das Problem auf ein eindimensionales Partitionierungsproblem
zurück, indem wir die Daten bezüglich optimierte raumfüllende Kurven sortieren. Dann lösen
wir dieses Problem durch Einsatz vom dynamischen Programmieren. Die I/O-Komplexität des
Verfahrens ist gleich der von externem Sortieren, da die I/O-Laufzeit der Methode durch die
Laufzeit des Sortierens dominiert wird.
Im letzten Teil der Arbeit haben wir die entwickelten Partitionierungsvefahren fĂĽr den Aufbau
von Geo-Histogrammen eingesetzt, da diese ähnlich zu R-Bäumen eine disjunkte Partitionierung
des Raums erzeugen. Ergebnisse von intensiven Experimenten zeigen, dass sich unter Verwendung
von neuen Partitionierungstechniken sowohl R-Bäume mit besserer Anfrageperformanz als
auch Geo-Histogrammen mit besserer Schätzqualität im Vergleich zu Konkurrenzverfahren generieren
lassen
Efficient Reorganisation of Hybrid Index Structures Supporting Multimedia Search Criteria
This thesis describes the development and setup of hybrid index structures. They are access methods for retrieval techniques in hybrid data spaces which are formed by one or more relational or normalised columns in conjunction with one non-relational or non-normalised column. Examples for these hybrid data spaces are, among others, textual data combined with geographical ones or data from enterprise content management systems. However, all non-relational data types may be stored as well as image feature vectors or comparable types.
Hybrid index structures are known to function efficiently regarding retrieval operations. Unfortunately, little information is available about reorganisation operations which insert or update the row tuples. The fundamental research is mainly executed in simulation based environments. This work is written ensuing from a previous thesis that implements hybrid access structures in realistic database surroundings. During this implementation it has become obvious that retrieval works efficiently. Yet, the restructuring approaches require too much effort to be set up, e.g., in web search engine environments where several thousands of documents are inserted or modified every day. These search engines rely on relational database systems as storage backends. Hence, the setup of these access methods for hybrid data spaces is required in real world database management systems.
This thesis tries to apply a systematic approach for the optimisation of the rearrangement algorithms inside realistic scenarios. Thus, a measurement and evaluation scheme is created which is repeatedly deployed to an evolving state and a model of hybrid index structures in order to optimise the regrouping algorithms to make a setup of hybrid index structures in real world information systems possible. Thus, a set of input corpora is selected which is applied to the test suite as well as an evaluation scheme.
To sum up, it can be said that this thesis describes input sets, a test suite including an evaluation scheme as well as optimisation iterations on reorganisation algorithms reflecting a theoretical model framework to provide efficient reorganisations of hybrid index structures supporting multimedia search criteria