1,968 research outputs found
Better bitmap performance with Roaring bitmaps
Bitmap indexes are commonly used in databases and search engines. By
exploiting bit-level parallelism, they can significantly accelerate queries.
However, they can use much memory, and thus we might prefer compressed bitmap
indexes. Following Oracle's lead, bitmaps are often compressed using run-length
encoding (RLE). Building on prior work, we introduce the Roaring compressed
bitmap format: it uses packed arrays for compression instead of RLE. We compare
it to two high-performance RLE-based bitmap encoding techniques: WAH (Word
Aligned Hybrid compression scheme) and Concise (Compressed `n' Composable
Integer Set). On synthetic and real data, we find that Roaring bitmaps (1)
often compress significantly better (e.g., 2 times) and (2) are faster than the
compressed alternatives (up to 900 times faster for intersections). Our results
challenge the view that RLE-based bitmap compression is best
Reordering Rows for Better Compression: Beyond the Lexicographic Order
Sorting database tables before compressing them improves the compression
rate. Can we do better than the lexicographical order? For minimizing the
number of runs in a run-length encoding compression scheme, the best approaches
to row-ordering are derived from traveling salesman heuristics, although there
is a significant trade-off between running time and compression. A new
heuristic, Multiple Lists, which is a variant on Nearest Neighbor that trades
off compression for a major running-time speedup, is a good option for very
large tables. However, for some compression schemes, it is more important to
generate long runs rather than few runs. For this case, another novel
heuristic, Vortex, is promising. We find that we can improve run-length
encoding up to a factor of 3 whereas we can improve prefix coding by up to 80%:
these gains are on top of the gains due to lexicographically sorting the table.
We prove that the new row reordering is optimal (within 10%) at minimizing the
runs of identical values within columns, in a few cases.Comment: to appear in ACM TOD
ArrayBridge: Interweaving declarative array processing with high-performance computing
Scientists are increasingly turning to datacenter-scale computers to produce
and analyze massive arrays. Despite decades of database research that extols
the virtues of declarative query processing, scientists still write, debug and
parallelize imperative HPC kernels even for the most mundane queries. This
impedance mismatch has been partly attributed to the cumbersome data loading
process; in response, the database community has proposed in situ mechanisms to
access data in scientific file formats. Scientists, however, desire more than a
passive access method that reads arrays from files.
This paper describes ArrayBridge, a bi-directional array view mechanism for
scientific file formats, that aims to make declarative array manipulations
interoperable with imperative file-centric analyses. Our prototype
implementation of ArrayBridge uses HDF5 as the underlying array storage library
and seamlessly integrates into the SciDB open-source array database system. In
addition to fast querying over external array objects, ArrayBridge produces
arrays in the HDF5 file format just as easily as it can read from it.
ArrayBridge also supports time travel queries from imperative kernels through
the unmodified HDF5 API, and automatically deduplicates between array versions
for space efficiency. Our extensive performance evaluation in NERSC, a
large-scale scientific computing facility, shows that ArrayBridge exhibits
statistically indistinguishable performance and I/O scalability to the native
SciDB storage engine.Comment: 12 pages, 13 figure
- …