26,373 research outputs found
Better bitmap performance with Roaring bitmaps
Bitmap indexes are commonly used in databases and search engines. By
exploiting bit-level parallelism, they can significantly accelerate queries.
However, they can use much memory, and thus we might prefer compressed bitmap
indexes. Following Oracle's lead, bitmaps are often compressed using run-length
encoding (RLE). Building on prior work, we introduce the Roaring compressed
bitmap format: it uses packed arrays for compression instead of RLE. We compare
it to two high-performance RLE-based bitmap encoding techniques: WAH (Word
Aligned Hybrid compression scheme) and Concise (Compressed `n' Composable
Integer Set). On synthetic and real data, we find that Roaring bitmaps (1)
often compress significantly better (e.g., 2 times) and (2) are faster than the
compressed alternatives (up to 900 times faster for intersections). Our results
challenge the view that RLE-based bitmap compression is best
Neural machine translation using bitmap fonts
Recently, translation systems based on neural networks are starting to compete with systems based on phrases. The systems which are based on neural networks use vectorial repre- sentations of words. However, one of the biggest challenges that machine translation still faces, is dealing with large vocabularies and morphologically rich languages. This work aims to adapt a neural machine translation system to translate from Chinese to Spanish, using as input different types of granularity: words, characters, bitmap fonts of Chinese characters or words. The fact of performing the interpretation of every character or word as a bitmap font allows for obtaining more informed vectorial representations. Best results are obtained when using the information of the word bitmap font.Postprint (published version
Chinese–Spanish neural machine translation enhanced with character and word bitmap fonts
Recently, machine translation systems based on neural networks have reached state-of-the-art results for some pairs of languages (e.g., German–English). In this paper, we are investigating the performance of neural machine translation in Chinese–Spanish, which is a challenging language pair. Given that the meaning of a Chinese word can be related to its graphical representation, this work aims to enhance neural machine translation by using as input a combination of: words or characters and their corresponding bitmap fonts. The fact of performing the interpretation of every word or character as a bitmap font generates more informed vectorial representations. Best results are obtained when using words plus their bitmap fonts obtaining an improvement (over a competitive neural MT baseline system) of almost six BLEU, five METEOR points and ranked coherently better in the human evaluation.Peer ReviewedPostprint (published version
Association Mining in Database Machine
Association rule is wildly used in most of the data mining technologies. Apriori algorithm is the fundamental association rule mining algorithm. FP-growth tree algorithm improves the performance by reduce the generation of the frequent item sets. Simplex algorithm is a advanced FP-growth algorithm by using bitmap structure with the simplex concept in geometry. The bitmap structure implementation is particular designed for storing the data in database machines to support parallel computing the association rule mining
Distinct counting with a self-learning bitmap
Counting the number of distinct elements (cardinality) in a dataset is a
fundamental problem in database management. In recent years, due to many of its
modern applications, there has been significant interest to address the
distinct counting problem in a data stream setting, where each incoming data
can be seen only once and cannot be stored for long periods of time. Many
probabilistic approaches based on either sampling or sketching have been
proposed in the computer science literature, that only require limited
computing and memory resources. However, the performances of these methods are
not scale-invariant, in the sense that their relative root mean square
estimation errors (RRMSE) depend on the unknown cardinalities. This is not
desirable in many applications where cardinalities can be very dynamic or
inhomogeneous and many cardinalities need to be estimated. In this paper, we
develop a novel approach, called self-learning bitmap (S-bitmap) that is
scale-invariant for cardinalities in a specified range. S-bitmap uses a binary
vector whose entries are updated from 0 to 1 by an adaptive sampling process
for inferring the unknown cardinality, where the sampling rates are reduced
sequentially as more and more entries change from 0 to 1. We prove rigorously
that the S-bitmap estimate is not only unbiased but scale-invariant. We
demonstrate that to achieve a small RRMSE value of or less, our
approach requires significantly less memory and consumes similar or less
operations than state-of-the-art methods for many common practice cardinality
scales. Both simulation and experimental studies are reported.Comment: Journal of the American Statistical Association (accepted
Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes
Bitmap indexes must be compressed to reduce input/output costs and minimize
CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, we use
techniques based on run-length encoding (RLE), such as Word-Aligned Hybrid
(WAH) compression. These techniques are sensitive to the order of the rows: a
simple lexicographical sort can divide the index size by 9 and make indexes
several times faster. We investigate reordering heuristics based on computed
attribute-value histograms. Simply permuting the columns of the table based on
these histograms can increase the sorting efficiency by 40%.Comment: To appear in proceedings of DOLAP 200
- …