13 research outputs found

    B-tree indexes for high update rates

    Get PDF
    In some applications, data capture dominates query processing. For example, monitoring moving objects often requires more insertions and updates than queries. Data gathering using automated sensors often exhibits this imbalance. More generally, indexing streams apparently is considered an unsolved problem. For those applications, B-tree indexes are reasonable choices if some trade-off decisions are tilted towards optimization of updates rather than of queries. This paper surveys techniques that let B-trees sustain very high update rates, up to multiple orders of magnitude higher than tradi-tional B-trees, at the expense of query processing performance. Perhaps not surprisingly, some of these techniques are reminiscent of those employed during index creation, index rebuild, etc., while others are derived from other well known technologies such as differential files and log-structured file systems

    Secondary Indexing in One Dimension: Beyond B-trees and Bitmap Indexes

    Full text link
    Let S be a finite, ordered alphabet, and let x = x_1 x_2 ... x_n be a string over S. A "secondary index" for x answers alphabet range queries of the form: Given a range [a_l,a_r] over S, return the set I_{[a_l;a_r]} = {i |x_i \in [a_l; a_r]}. Secondary indexes are heavily used in relational databases and scientific data analysis. It is well-known that the obvious solution, storing a dictionary for the position set associated with each character, does not always give optimal query time. In this paper we give the first theoretically optimal data structure for the secondary indexing problem. In the I/O model, the amount of data read when answering a query is within a constant factor of the minimum space needed to represent I_{[a_l;a_r]}, assuming that the size of internal memory is (|S| log n)^{delta} blocks, for some constant delta > 0. The space usage of the data structure is O(n log |S|) bits in the worst case, and we further show how to bound the size of the data structure in terms of the 0-th order entropy of x. We show how to support updates achieving various time-space trade-offs. We also consider an approximate version of the basic secondary indexing problem where a query reports a superset of I_{[a_l;a_r]} containing each element not in I_{[a_l;a_r]} with probability at most epsilon, where epsilon > 0 is the false positive probability. For this problem the amount of data that needs to be read by the query algorithm is reduced to O(|I_{[a_l;a_r]}| log(1/epsilon)) bits.Comment: 16 page

    Template B+ trees: an index scheme for fast data streams with distributed append-only stores

    Get PDF
    Distributed systems are now commonly used to manage massive data flooding from the physical world, such as user-generated content from online social media and communication records from mobile phones. The new generation of distributed data management systems, such as HBase, Cassandra and Riak, are designed to perform queries and tuple insertions only. Other database operations such as deletions and updates are simulated by appending the keys associated with the target tuples to operation logs. Such an append-only store architecture maximizes the processing throughput on incoming data, but potentially incurs higher costs during query processing, because additional computation is needed to generate consistent snapshots of the database. Indexing is the key to enable efficient query processing by fast data retrieval and aggregation under such a system architecture. This thesis presents a new in-memory indexing scheme for distributed append-only stores. Our new scheme utilizes traditional index structures based on B+ trees and their variants to create an efficient in-memory template-based tree without the overhead of expensive node splits. We also propose the use of optimized domain partitioning and multi-thread insertion techniques to exploit the advantages of the template B+ tree structure. Our empirical evaluations show that insertion throughput is five times higher with template B+ trees than with HBase, on a variety of real and synthetic workloads

    OCTOPUS: Efficient Query Execution on Dynamic Mesh Datasets

    Get PDF
    Scientists in many disciplines use spatial mesh models to study physical phenomena. Simulating natural phenomena by changing meshes over time helps to understand and predict future behavior of the phenomena. The higher the precision of the mesh models, the more insight do the scientists gain and they thus continuously increase the detail of the meshes and build them as detailed as their instruments and the simulation hardware allow. In the process, the data volume also increases, slowing down the execution of spatial range queries needed to monitor the simulation considerably. Indexing speeds up range query execution, but the overhead to maintain the indexes is considerable because almost the entire mesh changes unpredictably at every simulation step. Using a simple linear scan, on the other hand, requires accessing the entire mesh and the performance deteriorates as the size of the dataset grows. In this paper we propose OCTOPUS, a strategy for executing range queries on mesh datasets that change unpredictably during simulations. In OCTOPUS we use the key insight that the mesh surface along with the mesh connectivity is sufficient to retrieve accurate query results efficiently. With this novel query execution strategy, OCTOPUS minimizes index maintenance cost and reduces query execution time considerably. Our experiments show that OCTOPUS achieves a speedup between 7.2x and 9.2x compared to the state of the art and that it scales better with increasing mesh dataset size and detail

    Efficient bulk-loading methods for temporal and multidimensional index structures

    Get PDF
    Nahezu alle naturwissenschaftlichen Bereiche profitieren von neuesten Analyse- und Verarbeitungsmethoden fĂŒr große Datenmengen. Diese Verfahren setzten eine effiziente Verarbeitung von geo- und zeitbezogenen Daten voraus, da die Zeit und die Position wichtige Attribute vieler Daten sind. Die effiziente Anfrageverarbeitung wird insbesondere durch den Einsatz von Indexstrukturen ermöglicht. Im Fokus dieser Arbeit liegen zwei Indexstrukturen: Multiversion B-Baum (MVBT) und R-Baum. Die erste Struktur wird fĂŒr die Verwaltung von zeitbehafteten Daten, die zweite fĂŒr die Indexierung von mehrdimensionalen Rechteckdaten eingesetzt. StĂ€ndig- und schnellwachsendes Datenvolumen stellt eine große Herausforderung an die Informatik dar. Der Aufbau und das Aktualisieren von Indexen mit herkömmlichen Methoden (Datensatz fĂŒr Datensatz) ist nicht mehr effizient. Um zeitnahe und kosteneffiziente Datenverarbeitung zu ermöglichen, werden Verfahren zum schnellen Laden von Indexstrukturen dringend benötigt. Im ersten Teil der Arbeit widmen wir uns der Frage, ob es ein Verfahren fĂŒr das Laden von MVBT existiert, das die gleiche I/O-KomplexitĂ€t wie das externe Sortieren besitz. Bis jetzt blieb diese Frage unbeantwortet. In dieser Arbeit haben wir eine neue Kostruktionsmethode entwickelt und haben gezeigt, dass diese gleiche ZeitkomplexitĂ€t wie das externe Sortieren besitzt. Dabei haben wir zwei algorithmische Techniken eingesetzt: Gewichts-Balancierung und Puffer-BĂ€ume. Unsere Experimenten zeigen, dass das Resultat nicht nur theoretischer Bedeutung ist. Im zweiten Teil der Arbeit beschĂ€ftigen wir uns mit der Frage, ob und wie statistische Informationen ĂŒber Geo-Anfragen ausgenutzt werden können, um die Anfrageperformanz von R-BĂ€umen zu verbessern. Unsere neue Methode verwendet Informationen wie SeitenverhĂ€ltnis und SeitenlĂ€ngen eines reprĂ€sentativen Anfragerechtecks, um einen guten R-Baum bezĂŒglich eines hĂ€ufig eingesetzten Kostenmodells aufzubauen. Falls diese Informationen nicht verfĂŒgbar sind, optimieren wir R-BĂ€ume bezĂŒglich der Summe der Volumina von minimal umgebenden Rechtecken der Blattknoten. Da das Problem des Aufbaus von optimalen R-BĂ€umen bezĂŒglich dieses Kostenmaßes NP-hart ist, fĂŒhren wir zunĂ€chst das Problem auf ein eindimensionales Partitionierungsproblem zurĂŒck, indem wir die Daten bezĂŒglich optimierte raumfĂŒllende Kurven sortieren. Dann lösen wir dieses Problem durch Einsatz vom dynamischen Programmieren. Die I/O-KomplexitĂ€t des Verfahrens ist gleich der von externem Sortieren, da die I/O-Laufzeit der Methode durch die Laufzeit des Sortierens dominiert wird. Im letzten Teil der Arbeit haben wir die entwickelten Partitionierungsvefahren fĂŒr den Aufbau von Geo-Histogrammen eingesetzt, da diese Ă€hnlich zu R-BĂ€umen eine disjunkte Partitionierung des Raums erzeugen. Ergebnisse von intensiven Experimenten zeigen, dass sich unter Verwendung von neuen Partitionierungstechniken sowohl R-BĂ€ume mit besserer Anfrageperformanz als auch Geo-Histogrammen mit besserer SchĂ€tzqualitĂ€t im Vergleich zu Konkurrenzverfahren generieren lassen

    Tietuekimppujen indeksointi flash-muistilla

    Get PDF
    In database applications, bulk operations which affect multiple records at once are common. They are performed when operations on single records at a time are not efficient enough. They can occur in several ways, both by applications naturally having bulk operations (such as a sales database which updates daily) and by applications performing them routinely as part of some other operation. While bulk operations have been studied for decades, their use with flash memory has been studied less. Flash memory, an increasingly popular alternative/complement to magnetic hard disks, has far better seek times, low power consumption and other desirable characteristics for database applications. However, erasing data is a costly operation, which means that designing index structures specifically for flash disks is useful. This thesis will investigate flash memory on data structures in general, identifying some common design traits, and incorporate those traits into a novel index structure, the bulk index. The bulk index is an index structure for bulk operations on flash memory, and was experimentally compared to a flash-based index structure that has shown impressive results, the Lazy Adaptive Tree (LA-tree for short). The bulk insertion experiments were made with varying-sized elementary bulks, i.e. maximal sets of inserted keys that fall between two consecutive keys in the existing data. The bulk index consistently performed better than the LA-tree, and especially well on bulk insertion experiments with many very small or a few very large elementary bulks, or with large inserted bulks. It was more than 4 times as fast at best. On range searches, it performed up to 50 % faster than the LA-tree, performing better on large ranges. Range deletions were also shown to be constant-time on the bulk index.Tietokantasovelluksissa kimppuoperaatiot jotka vaikuttavat useampaan alkioon kerralla ovat yleisiÀ, ja niitÀ kÀytetÀÀn tehostamaan tietokannan toimintaa. NiitÀ voi kÀyttÀÀ kun data lisÀtÀÀn tietokantaan suuressa erÀssÀ (esimerkiksi myyntidata jota pÀivitetÀÀn kerran pÀivÀssÀ)tai osana muita tietokantaoperaatioita. Kimppuoperaatioita on tutkittu jo vuosikymmeniÀ, mutta niiden kÀyttöÀ flash-muistilla on tutkittu vÀhemmÀn. Flash-muisti on yleistyvÀ muistiteknologiajota kÀytetÀÀn magneettisten kiintolevyjen sijaan tai niiden rinnalla. Sen tietokannoille hyödyllisiin ominaisuuksiin kuuluvat mm. nopeat hakuajat ja alhainen sÀhkönkulutus. Kuitenkin datan poisto levyltÀ on työlÀs operaatio flash-levyillÀ, mistÀ johtuen tietorakenteet kannattaa suunnitella erikseen flash-levyille. TÀmÀ työ tutkii flashin kÀyttöÀ tietorakenteissa ja koostaa niistÀ flashille soveltuvia suunnitteluperiaatteita. NÀitÀ periaatteita edustaa myös työssÀ esitetty uusi rakenne, kimppuhakemisto (bulk index). Kimppuhakemisto on tietorakenne kimppuoperaatioille flash-muistilla, ja sitÀ verrataan kokeellisesti LA-puuhun (Lazy Adaptive Tree, suom. laiska adaptiivinen puu), joka on suoriutunut hyvin kokeissa flash-muistilla. Kokeissa kÀytettiin vaihtelevan kokoisia alkeiskimppuja, eli maksimaalisia joukkoja lisÀtyssÀ datassa jotka sijoittuvat kahden olemassaolevan avaimen vÀliin. Kimppuhakemisto oli nopeampi kuin LA-puu, ja erityisen paljon nopeampi kimppulisÀyksissÀ pienellÀ mÀÀrÀllÀ hyvin suuria tai suurella mÀÀrÀllÀ hyvin pieniÀ alkeiskimppuja, tai suurilla kimppulisÀyksillÀ. Parhaimmillaan se oli yli neljÀ kertaa nopeampi. VÀlihauissa se oli jopa 50 % nopeampi kuin LA-puu, ja parempi suurten vÀlien kanssa. VÀlipoistot nÀytettiin vakioaikaisiksi kimppuhakemistossa

    Design and Implementation of a Middleware for Uniform, Federated and Dynamic Event Processing

    Get PDF
    In recent years, real-time processing of massive event streams has become an important topic in the area of data analytics. It will become even more important in the future due to cheap sensors, a growing amount of devices and their ubiquitous inter-connection also known as the Internet of Things (IoT). Academia, industry and the open source community have developed several event processing (EP) systems that allow users to define, manage and execute continuous queries over event streams. They achieve a significantly better performance than the traditional store-then-process'' approach in which events are first stored and indexed in a database. Because EP systems have different roots and because of the lack of standardization, the system landscape became highly heterogenous. Today's EP systems differ in APIs, execution behaviors and query languages. This thesis presents the design and implementation of a novel middleware that abstracts from different EP systems and provides a uniform API, execution behavior and query language to users and developers. As a consequence, the presented middleware overcomes the problem of vendor lock-in and different EP systems are enabled to cooperate with each other. In practice, event streams differ dramatically in volume and velocity. We show therefore how the middleware can connect to not only different EP systems, but also database systems and a native implementation. Emerging applications such as the IoT raise novel challenges and require EP to be more dynamic. We present extensions to the middleware that enable self-adaptivity which is needed in context-sensitive applications and those that deal with constantly varying sets of event producers and consumers. Lastly, we extend the middleware to fully support the processing of events containing spatial data and to be able to run distributed in the form of a federation of heterogenous EP systems

    WRITE-INTENSIVE DATA MANAGEMENT IN LOG-STRUCTURED STORAGE

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore