12 research outputs found
Attribute Value Reordering For Efficient Hybrid OLAP
The normalization of a data cube is the ordering of the attribute values. For
large multidimensional arrays where dense and sparse chunks are stored
differently, proper normalization can lead to improved storage efficiency. We
show that it is NP-hard to compute an optimal normalization even for 1x3
chunks, although we find an exact algorithm for 1x2 chunks. When dimensions are
nearly statistically independent, we show that dimension-wise attribute
frequency sorting is an optimal normalization and takes time O(d n log(n)) for
data cubes of size n^d. When dimensions are not independent, we propose and
evaluate several heuristics. The hybrid OLAP (HOLAP) storage mechanism is
already 19%-30% more efficient than ROLAP, but normalization can improve it
further by 9%-13% for a total gain of 29%-44% over ROLAP
Attribute Value Reordering for Efficient Hybrid OLAP
The normalization of a data cube is the process of choosing an ordering for the attribute values, and the chosen ordering will affect the physical storage of the cube's data. For large multidimensional arrays, proper normalization can lead to more efficient storage in hybrid OLAP contexts that store dense and sparse chunks differently. We show that it is NP-hard to compute an optimal normalization even for 1x3 chunks, although we find an exact algorithm for 1x2 chunks. When attributes are nearly statistically independent, we show that an optimal normalization is given by dimension-wise attribute frequency sorting, which can be done in time O(d n log(n)) for data cubes of size n^d. When attributes are not independent, we propose and evaluate a number of heuristics.\ud
\ud
Our optimized hybrid OLAP storage mechanism was observed to be 44% more storage efficient than ROLAP and the gains due to normalization alone accounted for 45% of this increase in efficiency
Better bitmap performance with Roaring bitmaps
Bitmap indexes are commonly used in databases and search engines. By
exploiting bit-level parallelism, they can significantly accelerate queries.
However, they can use much memory, and thus we might prefer compressed bitmap
indexes. Following Oracle's lead, bitmaps are often compressed using run-length
encoding (RLE). Building on prior work, we introduce the Roaring compressed
bitmap format: it uses packed arrays for compression instead of RLE. We compare
it to two high-performance RLE-based bitmap encoding techniques: WAH (Word
Aligned Hybrid compression scheme) and Concise (Compressed `n' Composable
Integer Set). On synthetic and real data, we find that Roaring bitmaps (1)
often compress significantly better (e.g., 2 times) and (2) are faster than the
compressed alternatives (up to 900 times faster for intersections). Our results
challenge the view that RLE-based bitmap compression is best
SPARSITY HANDLING AND DATA EXPLOSION IN OLAP SYSTEMS
A common problem with OnLine Analytical Processing (OLAP) databases is data explosion - data size multiplies, when it is loaded from the source data into multidimensional cubes. Data explosion is not an issue for small databases, but can be serious problems with large databases. In this paper we discuss the sparsity and data explosion phenomenon in multidimensional data model, which lie at the core of OLAP systems. Our researches over five companies with different branch of business confirm the observations that in reality most of the cubes are extremely sparse. We also consider a different method that relational and multidimensional severs applies to reduce the data explosion and sparsity problems as compression and indexes techniques, partitioning, preliminary aggregations
Reordering Columns for Smaller Indexes
Column-oriented indexes-such as projection or bitmap indexes-are compressed
by run-length encoding to reduce storage and increase speed. Sorting the tables
improves compression. On realistic data sets, permuting the columns in the
right order before sorting can reduce the number of runs by a factor of two or
more. Unfortunately, determining the best column order is NP-hard. For many
cases, we prove that the number of runs in table columns is minimized if we
sort columns by increasing cardinality. Experimentally, sorting based on
Hilbert space-filling curves is poor at minimizing the number of runs.Comment: to appear in Information Science