6,531 research outputs found
Index Strukturen fĂĽr Data Warehouse
0 Title and Table of Contents i
1\. Introduction 1
2\. State of the Art of Data Warehouse Research 5
3\. Data Storage and Index Structures 15
4\. Mixed Integer Problems for Finding Optimal Tree-based Index Structures 35
5\. Aggregated Data in Tree-Based Index Structures 43
6\. Performance Models for Tree-Based Index Structures 63
7\. Techniques for Comparing Index Structures 89
8\. Conclusion and Outlook 113
Bibliographie 116
Appendix 125This thesis investigates which index structures support query processing in
typical data warehouse environments most efficiently. Data warehouse
applications differ significantly from traditional transaction-oriented
operational applications. Therefore, the techniques applied in transaction-
oriented systems cannot be used in the context of data warehouses and new
techniques must be developed. The thesis shows that the time complexity for
the computation of optimal tree-based index structures prohibits its use in
real world applications. Therefore, we improve heuristic techniques (e. g.,
R*-tree) to process range queries on aggregated data more efficiently.
Experiments show the benefits of this approach for different kinds of typical
data warehouse queries. Performance models estimate the behavior of standard
index structures and the behavior of the extended index structures. We
introduce a new model that considers the distribution of data. We show
experimentally that the new model is more precise than other models known from
literature. Two techniques compare two tree-based index structures with two
bitmap indexing techniques. The performance of these index structures depends
on a set of different parameters. Our results show which index structure
performs most efficiently depending on the parameters.In dieser Arbeit wird untersucht, welche Indexstrukturen Anfragen in typischen
Data Warehouse Systemen effizient unterstĂĽtzen. Indexstrukturen, seit mehr als
zwanzig Jahren Forschungsgegenstand im Datenbankbereich, wurden in der
Vergangenheit fĂĽr transaktionsorientierte Systeme optimiert. Ein Kennzeichen
dieser Systeme ist die effiziente UnterstĂĽtzung von EinfĂĽge-, Ă„nderungs- und
Löschoperationen auf einzelnen Datensätzen. Typische Operationen in Data
Warehouse Systemen sind dagegen komplexe Anfragen auf groĂźen relativ
statischen Datenmengen. Aufgrund dieser veränderten Anforderungen müssen
Datenbankmanagementsysteme, die fĂĽr Data Warehouses eingesetzt werden, andere
Techniken nutzen, um komplexe Anfragen effizient zu unterstützen. Zunächst
wird ein Ansatz untersucht, der mit Hilfe eines gemischt ganzzahligen
Optimierungsproblems eine optimale Indexstruktur berechnet. Da die Kosten fĂĽr
die Berechnung dieser optimalen Indexstruktur mit der Anzahl der zu
indizierenden Datensätze exponentiell steigen, wird in anschließenden Teilen
der Arbeit heuristischen Ansätzen nachgegangen, die mit der Größe der zu
indizierenden Datensätze skalieren. Ein Ansatz erweitert auf Bäumen basierende
Indexstrukturen um aggregierte Daten in den inneren Knoten. Experimentell wird
gezeigt, daĂź mit Hilfe der materialisierten Zwischenergebnisse in den inneren
Knoten Bereichsanfragen auf aggregierten Daten wesentlich schneller bearbeitet
werden. Um das Leistungsverhalten von Indexstrukturen mit und ohne
materialisierte Zwischenergebnisse zu untersuchen, wird das PISA Modell
(Performance of Index Structures with and without Aggregated Data) entwickelt.
In diesem Modell wird die Verteilung der Daten und die Verteilung der Anfragen
berĂĽcksichtigt. Das PISA Modell wird an gleich-, schief- und normalverteilte
Datensätze angepaßt. Experimentell wird gezeigt, daß das PISA Modell mit einer
höheren Präzision als die bisher aus der Literatur bekannten Modelle arbeitet.
Die Leistung von Indexstrukturen hängt von unterschiedlichen Parametern ab. In
dieser Arbeit werden zwei Techniken vorgestellt, die abhängig von einer
bestimmten Menge von Parametern Indexstrukturen vergleichen. Mit Hilfe von
Klassifikationsbäumen wird z. B. gezeigt, daß die Blockgröße die relative
Leistung weniger beeinfluĂźt als andere Parameter. Ein weiteres Ergebnis ist,
daß Bitmap-Indexstrukturen von den Verbesserungen neuerer Sekundärspeicher
stärker profitieren als heute übliche auf Bäumen basierende Indexstrukturen.
Bitmap-Indexierungstechniken bieten noch ein groĂźes Potential fĂĽr weitere
Leistungssteigerungen im Datenbankbereich
Fast and Lean Immutable Multi-Maps on the JVM based on Heterogeneous Hash-Array Mapped Tries
An immutable multi-map is a many-to-many thread-friendly map data structure
with expected fast insert and lookup operations. This data structure is used
for applications processing graphs or many-to-many relations as applied in
static analysis of object-oriented systems. When processing such big data sets
the memory overhead of the data structure encoding itself is a memory usage
bottleneck. Motivated by reuse and type-safety, libraries for Java, Scala and
Clojure typically implement immutable multi-maps by nesting sets as the values
with the keys of a trie map. Like this, based on our measurements the expected
byte overhead for a sparse multi-map per stored entry adds up to around 65B,
which renders it unfeasible to compute with effectively on the JVM.
In this paper we propose a general framework for Hash-Array Mapped Tries on
the JVM which can store type-heterogeneous keys and values: a Heterogeneous
Hash-Array Mapped Trie (HHAMT). Among other applications, this allows for a
highly efficient multi-map encoding by (a) not reserving space for empty value
sets and (b) inlining the values of singleton sets while maintaining a (c)
type-safe API.
We detail the necessary encoding and optimizations to mitigate the overhead
of storing and retrieving heterogeneous data in a hash-trie. Furthermore, we
evaluate HHAMT specifically for the application to multi-maps, comparing them
to state-of-the-art encodings of multi-maps in Java, Scala and Clojure. We
isolate key differences using microbenchmarks and validate the resulting
conclusions on a real world case in static analysis. The new encoding brings
the per key-value storage overhead down to 30B: a 2x improvement. With
additional inlining of primitive values it reaches a 4x improvement
Compressing High-Dimensional Data Spaces Using Non-Differential Augmented Vector Quantization
query processing times and space requirements. Database compression has been
discovered to alleviate the I/O bottleneck, reduce disk space, improve disk access speed,
speed up query, reduce overall retrieval time and increase the effective I/O bandwidth.
However, random access to individual tuples in a compressed database is very difficult to
achieve with most available compression techniques.
We propose a lossless compression technique called non-differential augmented vector
quantization, a close variant of the novel augmented vector quantization. The technique is
applicable to a collection of tuples and especially effective for tuples with many low to
medium cardinality fields. In addition, the technique supports standard database
operations, permits very fast random access and atomic decompression of tuples in large
collections. The technique maps a database relation into a static bitmap index cached
access structure. Consequently, we were able to achieve substantial savings in space by
storing each database tuple as a bit value in the computer memory.
Important distinguishing characteristics of our technique is that individual tuples can be
compressed and decompressed, rather than a full page or entire relation at a time, (b) the
information needed for tuple compression and decompression can reside in the memory or
at worst in a single page. Promising application domains include decision support systems,
statistical databases and life databases with low cardinality fields and possibly no text
field
Better bitmap performance with Roaring bitmaps
Bitmap indexes are commonly used in databases and search engines. By
exploiting bit-level parallelism, they can significantly accelerate queries.
However, they can use much memory, and thus we might prefer compressed bitmap
indexes. Following Oracle's lead, bitmaps are often compressed using run-length
encoding (RLE). Building on prior work, we introduce the Roaring compressed
bitmap format: it uses packed arrays for compression instead of RLE. We compare
it to two high-performance RLE-based bitmap encoding techniques: WAH (Word
Aligned Hybrid compression scheme) and Concise (Compressed `n' Composable
Integer Set). On synthetic and real data, we find that Roaring bitmaps (1)
often compress significantly better (e.g., 2 times) and (2) are faster than the
compressed alternatives (up to 900 times faster for intersections). Our results
challenge the view that RLE-based bitmap compression is best
Load-balanced Range Query Workload Partitioning for Compressed Spatial Hierarchical Bitmap (cSHB) Indexes
abstract: The spatial databases are used to store geometric objects such as points, lines, polygons. Querying such complex spatial objects becomes a challenging task. Index structures are used to improve the lookup performance of the stored objects in the databases, but traditional index structures cannot perform well in case of spatial databases. A significant amount of research is made to ingest, index and query the spatial objects based on different types of spatial queries, such as range, nearest neighbor, and join queries. Compressed Spatial Bitmap Index (cSHB) structure is one such example of indexing and querying approach that supports spatial range query workloads (set of queries). cSHB indexes and many other approaches lack parallel computation. The massive amount of spatial data requires a lot of computation and traditional methods are insufficient to address these issues. Other existing parallel processing approaches lack in load-balancing of parallel tasks which leads to resource overloading bottlenecks.
In this thesis, I propose novel spatial partitioning techniques, Max Containment Clustering and Max Containment Clustering with Separation, to create load-balanced partitions of a range query workload. Each partition takes a similar amount of time to process the spatial queries and reduces the response latency by minimizing the disk access cost and optimizing the bitmap operations. The partitions created are processed in parallel using cSHB indexes. The proposed techniques utilize the block-based organization of bitmaps in the cSHB index and improve the performance of the cSHB index for processing a range query workload.Dissertation/ThesisMasters Thesis Computer Science 201
- …