26,373 research outputs found

    Better bitmap performance with Roaring bitmaps

    Get PDF
    Bitmap indexes are commonly used in databases and search engines. By exploiting bit-level parallelism, they can significantly accelerate queries. However, they can use much memory, and thus we might prefer compressed bitmap indexes. Following Oracle's lead, bitmaps are often compressed using run-length encoding (RLE). Building on prior work, we introduce the Roaring compressed bitmap format: it uses packed arrays for compression instead of RLE. We compare it to two high-performance RLE-based bitmap encoding techniques: WAH (Word Aligned Hybrid compression scheme) and Concise (Compressed `n' Composable Integer Set). On synthetic and real data, we find that Roaring bitmaps (1) often compress significantly better (e.g., 2 times) and (2) are faster than the compressed alternatives (up to 900 times faster for intersections). Our results challenge the view that RLE-based bitmap compression is best

    Neural machine translation using bitmap fonts

    Get PDF
    Recently, translation systems based on neural networks are starting to compete with systems based on phrases. The systems which are based on neural networks use vectorial repre- sentations of words. However, one of the biggest challenges that machine translation still faces, is dealing with large vocabularies and morphologically rich languages. This work aims to adapt a neural machine translation system to translate from Chinese to Spanish, using as input different types of granularity: words, characters, bitmap fonts of Chinese characters or words. The fact of performing the interpretation of every character or word as a bitmap font allows for obtaining more informed vectorial representations. Best results are obtained when using the information of the word bitmap font.Postprint (published version

    Chinese–Spanish neural machine translation enhanced with character and word bitmap fonts

    Get PDF
    Recently, machine translation systems based on neural networks have reached state-of-the-art results for some pairs of languages (e.g., German–English). In this paper, we are investigating the performance of neural machine translation in Chinese–Spanish, which is a challenging language pair. Given that the meaning of a Chinese word can be related to its graphical representation, this work aims to enhance neural machine translation by using as input a combination of: words or characters and their corresponding bitmap fonts. The fact of performing the interpretation of every word or character as a bitmap font generates more informed vectorial representations. Best results are obtained when using words plus their bitmap fonts obtaining an improvement (over a competitive neural MT baseline system) of almost six BLEU, five METEOR points and ranked coherently better in the human evaluation.Peer ReviewedPostprint (published version

    Association Mining in Database Machine

    Get PDF
    Association rule is wildly used in most of the data mining technologies. Apriori algorithm is the fundamental association rule mining algorithm. FP-growth tree algorithm improves the performance by reduce the generation of the frequent item sets. Simplex algorithm is a advanced FP-growth algorithm by using bitmap structure with the simplex concept in geometry. The bitmap structure implementation is particular designed for storing the data in database machines to support parallel computing the association rule mining

    Distinct counting with a self-learning bitmap

    Full text link
    Counting the number of distinct elements (cardinality) in a dataset is a fundamental problem in database management. In recent years, due to many of its modern applications, there has been significant interest to address the distinct counting problem in a data stream setting, where each incoming data can be seen only once and cannot be stored for long periods of time. Many probabilistic approaches based on either sampling or sketching have been proposed in the computer science literature, that only require limited computing and memory resources. However, the performances of these methods are not scale-invariant, in the sense that their relative root mean square estimation errors (RRMSE) depend on the unknown cardinalities. This is not desirable in many applications where cardinalities can be very dynamic or inhomogeneous and many cardinalities need to be estimated. In this paper, we develop a novel approach, called self-learning bitmap (S-bitmap) that is scale-invariant for cardinalities in a specified range. S-bitmap uses a binary vector whose entries are updated from 0 to 1 by an adaptive sampling process for inferring the unknown cardinality, where the sampling rates are reduced sequentially as more and more entries change from 0 to 1. We prove rigorously that the S-bitmap estimate is not only unbiased but scale-invariant. We demonstrate that to achieve a small RRMSE value of ϵ\epsilon or less, our approach requires significantly less memory and consumes similar or less operations than state-of-the-art methods for many common practice cardinality scales. Both simulation and experimental studies are reported.Comment: Journal of the American Statistical Association (accepted

    Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes

    Get PDF
    Bitmap indexes must be compressed to reduce input/output costs and minimize CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, we use techniques based on run-length encoding (RLE), such as Word-Aligned Hybrid (WAH) compression. These techniques are sensitive to the order of the rows: a simple lexicographical sort can divide the index size by 9 and make indexes several times faster. We investigate reordering heuristics based on computed attribute-value histograms. Simply permuting the columns of the table based on these histograms can increase the sorting efficiency by 40%.Comment: To appear in proceedings of DOLAP 200
    • …
    corecore