2,395 research outputs found
Lempel-Ziv Parsing in External Memory
For decades, computing the LZ factorization (or LZ77 parsing) of a string has
been a requisite and computationally intensive step in many diverse
applications, including text indexing and data compression. Many algorithms for
LZ77 parsing have been discovered over the years; however, despite the
increasing need to apply LZ77 to massive data sets, no algorithm to date scales
to inputs that exceed the size of internal memory. In this paper we describe
the first algorithm for computing the LZ77 parsing in external memory. Our
algorithm is fast in practice and will allow the next generation of text
indexes to be realised for massive strings and string collections.Comment: 10 page
Handling Massive N-Gram Datasets Efficiently
This paper deals with the two fundamental problems concerning the handling of
large n-gram language models: indexing, that is compressing the n-gram strings
and associated satellite data without compromising their retrieval speed; and
estimation, that is computing the probability distribution of the strings from
a large textual source. Regarding the problem of indexing, we describe
compressed, exact and lossless data structures that achieve, at the same time,
high space reductions and no time degradation with respect to state-of-the-art
solutions and related software packages. In particular, we present a compressed
trie data structure in which each word following a context of fixed length k,
i.e., its preceding k words, is encoded as an integer whose value is
proportional to the number of words that follow such context. Since the number
of words following a given context is typically very small in natural
languages, we lower the space of representation to compression levels that were
never achieved before. Despite the significant savings in space, our technique
introduces a negligible penalty at query time. Regarding the problem of
estimation, we present a novel algorithm for estimating modified Kneser-Ney
language models, that have emerged as the de-facto choice for language modeling
in both academia and industry, thanks to their relatively low perplexity
performance. Estimating such models from large textual sources poses the
challenge of devising algorithms that make a parsimonious use of the disk. The
state-of-the-art algorithm uses three sorting steps in external memory: we show
an improved construction that requires only one sorting step thanks to
exploiting the properties of the extracted n-gram strings. With an extensive
experimental analysis performed on billions of n-grams, we show an average
improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February
2019, Article No: 2
On inverted index compression for search engine efficiency
Efficient access to the inverted index data structure is a key aspect for a search engine to achieve fast response times to users’ queries . While the performance of an information retrieval (IR) system can be enhanced through the compression of its posting lists, there is little recent work in the literature that thoroughly compares and analyses the performance of modern integer compression schemes across different types of posting information (document ids, frequencies, positions). In this paper, we experiment with different modern integer compression algorithms, integrating these into a modern IR system. Through comprehensive experiments conducted on two large, widely used document corpora and large query sets, our results show the benefit of compression for different types of posting information to the space- and time-efficiency of the search engine. Overall, we find that the simple Frame of Reference compression scheme results in the best query response times for all types of posting information. Moreover, we observe that the frequency and position posting information in Web corpora that have large volumes of anchor text are more challenging to compress, yet compression is beneficial in reducing average query response times
Decoding billions of integers per second through vectorization
In many important applications -- such as search engines and relational
database systems -- data is stored in the form of arrays of integers. Encoding
and, most importantly, decoding of these arrays consumes considerable CPU time.
Therefore, substantial effort has been made to reduce costs associated with
compression and decompression. In particular, researchers have exploited the
superscalar nature of modern processors and SIMD instructions. Nevertheless, we
introduce a novel vectorized scheme called SIMD-BP128 that improves over
previously proposed vectorized approaches. It is nearly twice as fast as the
previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the
same time, SIMD-BP128 saves up to 2 bits per integer. For even better
compression, we propose another new vectorized scheme (SIMD-FastPFOR) that has
a compression ratio within 10% of a state-of-the-art scheme (Simple-8b) while
being two times faster during decoding.Comment: For software, see https://github.com/lemire/FastPFor, For data, see
http://boytsov.info/datasets/clueweb09gap
- …