Search CORE

13 research outputs found

B-tree indexes for high update rates

Author: Graefe Goetz
Publication venue: Dagstuhl Seminar Proceedings. 05421 - Data Always and Everywhere - Management of Mobile, Ubiquitous, Pervasive, and Sensor Data
Publication date: 01/01/2006
Field of study

In some applications, data capture dominates query processing. For example, monitoring moving objects often requires more insertions and updates than queries. Data gathering using automated sensors often exhibits this imbalance. More generally, indexing streams apparently is considered an unsolved problem. For those applications, B-tree indexes are reasonable choices if some trade-off decisions are tilted towards optimization of updates rather than of queries. This paper surveys techniques that let B-trees sustain very high update rates, up to multiple orders of magnitude higher than tradi-tional B-trees, at the expense of query processing performance. Perhaps not surprisingly, some of these techniques are reminiscent of those employed during index creation, index rebuild, etc., while others are derived from other well known technologies such as differential files and log-structured file systems

Dagstuhl Research Online Publication Server

Secondary Indexing in One Dimension: Beyond B-trees and Bitmap Indexes

Author: Pagh Rasmus
Rao S. Srinivasa
Publication venue
Publication date: 18/11/2008
Field of study

Let S be a finite, ordered alphabet, and let x = x_1 x_2 ... x_n be a string over S. A "secondary index" for x answers alphabet range queries of the form: Given a range [a_l,a_r] over S, return the set I_{[a_l;a_r]} = {i |x_i \in [a_l; a_r]}. Secondary indexes are heavily used in relational databases and scientific data analysis. It is well-known that the obvious solution, storing a dictionary for the position set associated with each character, does not always give optimal query time. In this paper we give the first theoretically optimal data structure for the secondary indexing problem. In the I/O model, the amount of data read when answering a query is within a constant factor of the minimum space needed to represent I_{[a_l;a_r]}, assuming that the size of internal memory is (|S| log n)^{delta} blocks, for some constant delta > 0. The space usage of the data structure is O(n log |S|) bits in the worst case, and we further show how to bound the size of the data structure in terms of the 0-th order entropy of x. We show how to support updates achieving various time-space trade-offs. We also consider an approximate version of the basic secondary indexing problem where a query reports a superset of I_{[a_l;a_r]} containing each element not in I_{[a_l;a_r]} with probability at most epsilon, where epsilon > 0 is the false positive probability. For this problem the amount of data that needs to be read by the query algorithm is reduced to O(|I_{[a_l;a_r]}| log(1/epsilon)) bits.Comment: 16 page

arXiv.org e-Print Archive

The IT University of Copenhagen's Repository

Template B+ trees: an index scheme for fast data streams with distributed append-only stores

Author: Mazumdar Parijat
Publication venue
Publication date: 01/05/2016
Field of study

Distributed systems are now commonly used to manage massive data flooding from the physical world, such as user-generated content from online social media and communication records from mobile phones. The new generation of distributed data management systems, such as HBase, Cassandra and Riak, are designed to perform queries and tuple insertions only. Other database operations such as deletions and updates are simulated by appending the keys associated with the target tuples to operation logs. Such an append-only store architecture maximizes the processing throughput on incoming data, but potentially incurs higher costs during query processing, because additional computation is needed to generate consistent snapshots of the database. Indexing is the key to enable efficient query processing by fast data retrieval and aggregation under such a system architecture. This thesis presents a new in-memory indexing scheme for distributed append-only stores. Our new scheme utilizes traditional index structures based on B+ trees and their variants to create an efficient in-memory template-based tree without the overhead of expensive node splits. We also propose the use of optimized domain partitioning and multi-thread insertion techniques to exploit the advantages of the template B+ tree structure. Our empirical evaluations show that insertion throughput is five times higher with template B+ trees than with HBase, on a variety of real and synthetic workloads

Illinois Digital Environment for Access to Learning and Scholarship Repository

Indexing Moving Objects Using Short-Lived Throwaway Indexes

Author: B. Cui
D. Hilbert
D.E. Knuth
D.G. Severance
D.V. Kalashnikov
G. Kollios
H. Tropf
J.A. Orenstein
L. Arge
M. Pelanis
P. Muth
T. Brinkhoff
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Crossref

OCTOPUS: Efficient Query Execution on Dynamic Mesh Datasets

Author: Ailamaki Anastasia
Heinis Thomas
Markram Henry
Schürmann Felix
Tauheed Farhan
Publication venue
Publication date: 02/12/2013
Field of study

Scientists in many disciplines use spatial mesh models to study physical phenomena. Simulating natural phenomena by changing meshes over time helps to understand and predict future behavior of the phenomena. The higher the precision of the mesh models, the more insight do the scientists gain and they thus continuously increase the detail of the meshes and build them as detailed as their instruments and the simulation hardware allow. In the process, the data volume also increases, slowing down the execution of spatial range queries needed to monitor the simulation considerably. Indexing speeds up range query execution, but the overhead to maintain the indexes is considerable because almost the entire mesh changes unpredictably at every simulation step. Using a simple linear scan, on the other hand, requires accessing the entire mesh and the performance deteriorates as the size of the dataset grows. In this paper we propose OCTOPUS, a strategy for executing range queries on mesh datasets that change unpredictably during simulations. In OCTOPUS we use the key insight that the mesh surface along with the mesh connectivity is sufficient to retrieve accurate query results efficiently. With this novel query execution strategy, OCTOPUS minimizes index maintenance cost and reduces query execution time considerably. Our experiments show that OCTOPUS achieves a speedup between 7.2x and 9.2x compared to the state of the art and that it scales better with increasing mesh dataset size and detail

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Efficient bulk-loading methods for temporal and multidimensional index structures

Author: Achakeev Daniar
Publication venue: Philipps-Universität Marburg
Publication date: 01/01/2013
Field of study

Nahezu alle naturwissenschaftlichen Bereiche profitieren von neuesten Analyse- und Verarbeitungsmethoden für große Datenmengen. Diese Verfahren setzten eine effiziente Verarbeitung von geo- und zeitbezogenen Daten voraus, da die Zeit und die Position wichtige Attribute vieler Daten sind. Die effiziente Anfrageverarbeitung wird insbesondere durch den Einsatz von Indexstrukturen ermöglicht. Im Fokus dieser Arbeit liegen zwei Indexstrukturen: Multiversion B-Baum (MVBT) und R-Baum. Die erste Struktur wird für die Verwaltung von zeitbehafteten Daten, die zweite für die Indexierung von mehrdimensionalen Rechteckdaten eingesetzt. Ständig- und schnellwachsendes Datenvolumen stellt eine große Herausforderung an die Informatik dar. Der Aufbau und das Aktualisieren von Indexen mit herkömmlichen Methoden (Datensatz für Datensatz) ist nicht mehr effizient. Um zeitnahe und kosteneffiziente Datenverarbeitung zu ermöglichen, werden Verfahren zum schnellen Laden von Indexstrukturen dringend benötigt. Im ersten Teil der Arbeit widmen wir uns der Frage, ob es ein Verfahren für das Laden von MVBT existiert, das die gleiche I/O-Komplexität wie das externe Sortieren besitz. Bis jetzt blieb diese Frage unbeantwortet. In dieser Arbeit haben wir eine neue Kostruktionsmethode entwickelt und haben gezeigt, dass diese gleiche Zeitkomplexität wie das externe Sortieren besitzt. Dabei haben wir zwei algorithmische Techniken eingesetzt: Gewichts-Balancierung und Puffer-Bäume. Unsere Experimenten zeigen, dass das Resultat nicht nur theoretischer Bedeutung ist. Im zweiten Teil der Arbeit beschäftigen wir uns mit der Frage, ob und wie statistische Informationen über Geo-Anfragen ausgenutzt werden können, um die Anfrageperformanz von R-Bäumen zu verbessern. Unsere neue Methode verwendet Informationen wie Seitenverhältnis und Seitenlängen eines repräsentativen Anfragerechtecks, um einen guten R-Baum bezüglich eines häufig eingesetzten Kostenmodells aufzubauen. Falls diese Informationen nicht verfügbar sind, optimieren wir R-Bäume bezüglich der Summe der Volumina von minimal umgebenden Rechtecken der Blattknoten. Da das Problem des Aufbaus von optimalen R-Bäumen bezüglich dieses Kostenmaßes NP-hart ist, führen wir zunächst das Problem auf ein eindimensionales Partitionierungsproblem zurück, indem wir die Daten bezüglich optimierte raumfüllende Kurven sortieren. Dann lösen wir dieses Problem durch Einsatz vom dynamischen Programmieren. Die I/O-Komplexität des Verfahrens ist gleich der von externem Sortieren, da die I/O-Laufzeit der Methode durch die Laufzeit des Sortierens dominiert wird. Im letzten Teil der Arbeit haben wir die entwickelten Partitionierungsvefahren für den Aufbau von Geo-Histogrammen eingesetzt, da diese ähnlich zu R-Bäumen eine disjunkte Partitionierung des Raums erzeugen. Ergebnisse von intensiven Experimenten zeigen, dass sich unter Verwendung von neuen Partitionierungstechniken sowohl R-Bäume mit besserer Anfrageperformanz als auch Geo-Histogrammen mit besserer Schätzqualität im Vergleich zu Konkurrenzverfahren generieren lassen

Publikations- und Dokumentenserver der Universitätsbibliothek Marburg

Tietuekimppujen indeksointi flash-muistilla

Author: Lehtonen Jonas
Publication venue
Publication date: 02/06/2014
Field of study

In database applications, bulk operations which affect multiple records at once are common. They are performed when operations on single records at a time are not efficient enough. They can occur in several ways, both by applications naturally having bulk operations (such as a sales database which updates daily) and by applications performing them routinely as part of some other operation. While bulk operations have been studied for decades, their use with flash memory has been studied less. Flash memory, an increasingly popular alternative/complement to magnetic hard disks, has far better seek times, low power consumption and other desirable characteristics for database applications. However, erasing data is a costly operation, which means that designing index structures specifically for flash disks is useful. This thesis will investigate flash memory on data structures in general, identifying some common design traits, and incorporate those traits into a novel index structure, the bulk index. The bulk index is an index structure for bulk operations on flash memory, and was experimentally compared to a flash-based index structure that has shown impressive results, the Lazy Adaptive Tree (LA-tree for short). The bulk insertion experiments were made with varying-sized elementary bulks, i.e. maximal sets of inserted keys that fall between two consecutive keys in the existing data. The bulk index consistently performed better than the LA-tree, and especially well on bulk insertion experiments with many very small or a few very large elementary bulks, or with large inserted bulks. It was more than 4 times as fast at best. On range searches, it performed up to 50 % faster than the LA-tree, performing better on large ranges. Range deletions were also shown to be constant-time on the bulk index.Tietokantasovelluksissa kimppuoperaatiot jotka vaikuttavat useampaan alkioon kerralla ovat yleisiä, ja niitä käytetään tehostamaan tietokannan toimintaa. Niitä voi käyttää kun data lisätään tietokantaan suuressa erässä (esimerkiksi myyntidata jota päivitetään kerran päivässä)tai osana muita tietokantaoperaatioita. Kimppuoperaatioita on tutkittu jo vuosikymmeniä, mutta niiden käyttöä flash-muistilla on tutkittu vähemmän. Flash-muisti on yleistyvä muistiteknologiajota käytetään magneettisten kiintolevyjen sijaan tai niiden rinnalla. Sen tietokannoille hyödyllisiin ominaisuuksiin kuuluvat mm. nopeat hakuajat ja alhainen sähkönkulutus. Kuitenkin datan poisto levyltä on työläs operaatio flash-levyillä, mistä johtuen tietorakenteet kannattaa suunnitella erikseen flash-levyille. Tämä työ tutkii flashin käyttöä tietorakenteissa ja koostaa niistä flashille soveltuvia suunnitteluperiaatteita. Näitä periaatteita edustaa myös työssä esitetty uusi rakenne, kimppuhakemisto (bulk index). Kimppuhakemisto on tietorakenne kimppuoperaatioille flash-muistilla, ja sitä verrataan kokeellisesti LA-puuhun (Lazy Adaptive Tree, suom. laiska adaptiivinen puu), joka on suoriutunut hyvin kokeissa flash-muistilla. Kokeissa käytettiin vaihtelevan kokoisia alkeiskimppuja, eli maksimaalisia joukkoja lisätyssä datassa jotka sijoittuvat kahden olemassaolevan avaimen väliin. Kimppuhakemisto oli nopeampi kuin LA-puu, ja erityisen paljon nopeampi kimppulisäyksissä pienellä määrällä hyvin suuria tai suurella määrällä hyvin pieniä alkeiskimppuja, tai suurilla kimppulisäyksillä. Parhaimmillaan se oli yli neljä kertaa nopeampi. Välihauissa se oli jopa 50 % nopeampi kuin LA-puu, ja parempi suurten välien kanssa. Välipoistot näytettiin vakioaikaisiksi kimppuhakemistossa

Aaltodoc Publication Archive

Design and Implementation of a Middleware for Uniform, Federated and Dynamic Event Processing

Author: Hoßbach Bastian
Publication venue: Philipps-Universität Marburg
Publication date: 01/01/2015
Field of study

In recent years, real-time processing of massive event streams has become an important topic in the area of data analytics. It will become even more important in the future due to cheap sensors, a growing amount of devices and their ubiquitous inter-connection also known as the Internet of Things (IoT). Academia, industry and the open source community have developed several event processing (EP) systems that allow users to define, manage and execute continuous queries over event streams. They achieve a significantly better performance than the traditional store-then-process'' approach in which events are first stored and indexed in a database. Because EP systems have different roots and because of the lack of standardization, the system landscape became highly heterogenous. Today's EP systems differ in APIs, execution behaviors and query languages. This thesis presents the design and implementation of a novel middleware that abstracts from different EP systems and provides a uniform API, execution behavior and query language to users and developers. As a consequence, the presented middleware overcomes the problem of vendor lock-in and different EP systems are enabled to cooperate with each other. In practice, event streams differ dramatically in volume and velocity. We show therefore how the middleware can connect to not only different EP systems, but also database systems and a native implementation. Emerging applications such as the IoT raise novel challenges and require EP to be more dynamic. We present extensions to the middleware that enable self-adaptivity which is needed in context-sensitive applications and those that deal with constantly varying sets of event producers and consumers. Lastly, we extend the middleware to fully support the processing of events containing spatial data and to be able to run distributed in the form of a federation of heterogenous EP systems

Publikations- und Dokumentenserver der Universitätsbibliothek Marburg

WRITE-INTENSIVE DATA MANAGEMENT IN LOG-STRUCTURED STORAGE

Author: WANG SHENG
Publication venue
Publication date: 19/04/2016
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS