6,531 research outputs found

    Index Strukturen fĂĽr Data Warehouse

    Get PDF
    0 Title and Table of Contents i 1\. Introduction 1 2\. State of the Art of Data Warehouse Research 5 3\. Data Storage and Index Structures 15 4\. Mixed Integer Problems for Finding Optimal Tree-based Index Structures 35 5\. Aggregated Data in Tree-Based Index Structures 43 6\. Performance Models for Tree-Based Index Structures 63 7\. Techniques for Comparing Index Structures 89 8\. Conclusion and Outlook 113 Bibliographie 116 Appendix 125This thesis investigates which index structures support query processing in typical data warehouse environments most efficiently. Data warehouse applications differ significantly from traditional transaction-oriented operational applications. Therefore, the techniques applied in transaction- oriented systems cannot be used in the context of data warehouses and new techniques must be developed. The thesis shows that the time complexity for the computation of optimal tree-based index structures prohibits its use in real world applications. Therefore, we improve heuristic techniques (e. g., R*-tree) to process range queries on aggregated data more efficiently. Experiments show the benefits of this approach for different kinds of typical data warehouse queries. Performance models estimate the behavior of standard index structures and the behavior of the extended index structures. We introduce a new model that considers the distribution of data. We show experimentally that the new model is more precise than other models known from literature. Two techniques compare two tree-based index structures with two bitmap indexing techniques. The performance of these index structures depends on a set of different parameters. Our results show which index structure performs most efficiently depending on the parameters.In dieser Arbeit wird untersucht, welche Indexstrukturen Anfragen in typischen Data Warehouse Systemen effizient unterstützen. Indexstrukturen, seit mehr als zwanzig Jahren Forschungsgegenstand im Datenbankbereich, wurden in der Vergangenheit für transaktionsorientierte Systeme optimiert. Ein Kennzeichen dieser Systeme ist die effiziente Unterstützung von Einfüge-, Änderungs- und Löschoperationen auf einzelnen Datensätzen. Typische Operationen in Data Warehouse Systemen sind dagegen komplexe Anfragen auf großen relativ statischen Datenmengen. Aufgrund dieser veränderten Anforderungen müssen Datenbankmanagementsysteme, die für Data Warehouses eingesetzt werden, andere Techniken nutzen, um komplexe Anfragen effizient zu unterstützen. Zunächst wird ein Ansatz untersucht, der mit Hilfe eines gemischt ganzzahligen Optimierungsproblems eine optimale Indexstruktur berechnet. Da die Kosten für die Berechnung dieser optimalen Indexstruktur mit der Anzahl der zu indizierenden Datensätze exponentiell steigen, wird in anschließenden Teilen der Arbeit heuristischen Ansätzen nachgegangen, die mit der Größe der zu indizierenden Datensätze skalieren. Ein Ansatz erweitert auf Bäumen basierende Indexstrukturen um aggregierte Daten in den inneren Knoten. Experimentell wird gezeigt, daß mit Hilfe der materialisierten Zwischenergebnisse in den inneren Knoten Bereichsanfragen auf aggregierten Daten wesentlich schneller bearbeitet werden. Um das Leistungsverhalten von Indexstrukturen mit und ohne materialisierte Zwischenergebnisse zu untersuchen, wird das PISA Modell (Performance of Index Structures with and without Aggregated Data) entwickelt. In diesem Modell wird die Verteilung der Daten und die Verteilung der Anfragen berücksichtigt. Das PISA Modell wird an gleich-, schief- und normalverteilte Datensätze angepaßt. Experimentell wird gezeigt, daß das PISA Modell mit einer höheren Präzision als die bisher aus der Literatur bekannten Modelle arbeitet. Die Leistung von Indexstrukturen hängt von unterschiedlichen Parametern ab. In dieser Arbeit werden zwei Techniken vorgestellt, die abhängig von einer bestimmten Menge von Parametern Indexstrukturen vergleichen. Mit Hilfe von Klassifikationsbäumen wird z. B. gezeigt, daß die Blockgröße die relative Leistung weniger beeinflußt als andere Parameter. Ein weiteres Ergebnis ist, daß Bitmap-Indexstrukturen von den Verbesserungen neuerer Sekundärspeicher stärker profitieren als heute übliche auf Bäumen basierende Indexstrukturen. Bitmap-Indexierungstechniken bieten noch ein großes Potential für weitere Leistungssteigerungen im Datenbankbereich

    Fast and Lean Immutable Multi-Maps on the JVM based on Heterogeneous Hash-Array Mapped Tries

    Get PDF
    An immutable multi-map is a many-to-many thread-friendly map data structure with expected fast insert and lookup operations. This data structure is used for applications processing graphs or many-to-many relations as applied in static analysis of object-oriented systems. When processing such big data sets the memory overhead of the data structure encoding itself is a memory usage bottleneck. Motivated by reuse and type-safety, libraries for Java, Scala and Clojure typically implement immutable multi-maps by nesting sets as the values with the keys of a trie map. Like this, based on our measurements the expected byte overhead for a sparse multi-map per stored entry adds up to around 65B, which renders it unfeasible to compute with effectively on the JVM. In this paper we propose a general framework for Hash-Array Mapped Tries on the JVM which can store type-heterogeneous keys and values: a Heterogeneous Hash-Array Mapped Trie (HHAMT). Among other applications, this allows for a highly efficient multi-map encoding by (a) not reserving space for empty value sets and (b) inlining the values of singleton sets while maintaining a (c) type-safe API. We detail the necessary encoding and optimizations to mitigate the overhead of storing and retrieving heterogeneous data in a hash-trie. Furthermore, we evaluate HHAMT specifically for the application to multi-maps, comparing them to state-of-the-art encodings of multi-maps in Java, Scala and Clojure. We isolate key differences using microbenchmarks and validate the resulting conclusions on a real world case in static analysis. The new encoding brings the per key-value storage overhead down to 30B: a 2x improvement. With additional inlining of primitive values it reaches a 4x improvement

    Compressing High-Dimensional Data Spaces Using Non-Differential Augmented Vector Quantization

    Get PDF
    query processing times and space requirements. Database compression has been discovered to alleviate the I/O bottleneck, reduce disk space, improve disk access speed, speed up query, reduce overall retrieval time and increase the effective I/O bandwidth. However, random access to individual tuples in a compressed database is very difficult to achieve with most available compression techniques. We propose a lossless compression technique called non-differential augmented vector quantization, a close variant of the novel augmented vector quantization. The technique is applicable to a collection of tuples and especially effective for tuples with many low to medium cardinality fields. In addition, the technique supports standard database operations, permits very fast random access and atomic decompression of tuples in large collections. The technique maps a database relation into a static bitmap index cached access structure. Consequently, we were able to achieve substantial savings in space by storing each database tuple as a bit value in the computer memory. Important distinguishing characteristics of our technique is that individual tuples can be compressed and decompressed, rather than a full page or entire relation at a time, (b) the information needed for tuple compression and decompression can reside in the memory or at worst in a single page. Promising application domains include decision support systems, statistical databases and life databases with low cardinality fields and possibly no text field

    Better bitmap performance with Roaring bitmaps

    Get PDF
    Bitmap indexes are commonly used in databases and search engines. By exploiting bit-level parallelism, they can significantly accelerate queries. However, they can use much memory, and thus we might prefer compressed bitmap indexes. Following Oracle's lead, bitmaps are often compressed using run-length encoding (RLE). Building on prior work, we introduce the Roaring compressed bitmap format: it uses packed arrays for compression instead of RLE. We compare it to two high-performance RLE-based bitmap encoding techniques: WAH (Word Aligned Hybrid compression scheme) and Concise (Compressed `n' Composable Integer Set). On synthetic and real data, we find that Roaring bitmaps (1) often compress significantly better (e.g., 2 times) and (2) are faster than the compressed alternatives (up to 900 times faster for intersections). Our results challenge the view that RLE-based bitmap compression is best

    Load-balanced Range Query Workload Partitioning for Compressed Spatial Hierarchical Bitmap (cSHB) Indexes

    Get PDF
    abstract: The spatial databases are used to store geometric objects such as points, lines, polygons. Querying such complex spatial objects becomes a challenging task. Index structures are used to improve the lookup performance of the stored objects in the databases, but traditional index structures cannot perform well in case of spatial databases. A significant amount of research is made to ingest, index and query the spatial objects based on different types of spatial queries, such as range, nearest neighbor, and join queries. Compressed Spatial Bitmap Index (cSHB) structure is one such example of indexing and querying approach that supports spatial range query workloads (set of queries). cSHB indexes and many other approaches lack parallel computation. The massive amount of spatial data requires a lot of computation and traditional methods are insufficient to address these issues. Other existing parallel processing approaches lack in load-balancing of parallel tasks which leads to resource overloading bottlenecks. In this thesis, I propose novel spatial partitioning techniques, Max Containment Clustering and Max Containment Clustering with Separation, to create load-balanced partitions of a range query workload. Each partition takes a similar amount of time to process the spatial queries and reduces the response latency by minimizing the disk access cost and optimizing the bitmap operations. The partitions created are processed in parallel using cSHB indexes. The proposed techniques utilize the block-based organization of bitmaps in the cSHB index and improve the performance of the cSHB index for processing a range query workload.Dissertation/ThesisMasters Thesis Computer Science 201
    • …
    corecore