89 research outputs found

    Universal Indexes for Highly Repetitive Document Collections

    Get PDF
    Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    Efficient and Effective Query Auto-Completion

    Full text link
    Query Auto-Completion (QAC) is an ubiquitous feature of modern textual search systems, suggesting possible ways of completing the query being typed by the user. Efficiency is crucial to make the system have a real-time responsiveness when operating in the million-scale search space. Prior work has extensively advocated the use of a trie data structure for fast prefix-search operations in compact space. However, searching by prefix has little discovery power in that only completions that are prefixed by the query are returned. This may impact negatively the effectiveness of the QAC system, with a consequent monetary loss for real applications like Web Search Engines and eCommerce. In this work we describe the implementation that empowers a new QAC system at eBay, and discuss its efficiency/effectiveness in relation to other approaches at the state-of-the-art. The solution is based on the combination of an inverted index with succinct data structures, a much less explored direction in the literature. This system is replacing the previous implementation based on Apache SOLR that was not always able to meet the required service-level-agreement.Comment: Published in SIGIR 202

    Integer Set Compression and Statistical Modeling

    Get PDF
    Compression of integer sets and sequences has been extensively studied for settings where elements follow a uniform probability distribution. In addition, methods exist that exploit clustering of elements in order to achieve higher compression performance. In this work, we address the case where enumeration of elements may be arbitrary or random, but where statistics is kept in order to estimate probabilities of elements. We present a recursive subset-size encoding method that is able to benefit from statistics, explore the effects of permuting the enumeration order based on element probabilities, and discuss general properties and possibilities for this class of compression problem

    The Many Qualities of a New Directly Accessible Compression Scheme

    Full text link
    We present a new variable-length computation-friendly encoding scheme, named SFDC (Succinct Format with Direct aCcesibility), that supports direct and fast accessibility to any element of the compressed sequence and achieves compression ratios often higher than those offered by other solutions in the literature. The SFDC scheme provides a flexible and simple representation geared towards either practical efficiency or compression ratios, as required. For a text of length nn over an alphabet of size σ\sigma and a fixed parameter λ\lambda, the access time of the proposed encoding is proportional to the length of the character's code-word, plus an expected O((Fσλ+33)/Fσ+1)\mathcal{O}((F_{\sigma - \lambda + 3} - 3)/F_{\sigma+1}) overhead, where FjF_j is the jj-th number of the Fibonacci sequence. In the overall it uses N+O(n(λ(Fσ+33)/Fσ+1))=N+O(n)N+\mathcal{O}\big(n \left(\lambda - (F_{\sigma+3}-3)/F_{\sigma+1}\big) \right) = N + \mathcal{O}(n) bits, where NN is the length of the encoded string. Experimental results show that the performance of our scheme is, in some respects, comparable with the performance of DACs and Wavelet Tees, which are among of the most efficient schemes. In addition our scheme is configured as a \emph{computation-friendly compression} scheme, as it counts several features that make it very effective in text processing tasks. In the string matching problem, that we take as a case study, we experimentally prove that the new scheme enables results that are up to 29 times faster than standard string-matching techniques on plain texts.Comment: 33 page

    Image compression using noncausal prediction

    Get PDF
    Image compression commonly is achieved using prediction of the value of pixels from surrounding pixels. Normally the choice of pixels used in the prediction is restricted to previously scanned pixels. A better prediction can be achieved if pixels on all sides of the pixel to be predicted are used. A prediction and decoding method is proposed that is independent of scanning order of the image. The decoding process makes use of an iterative decoder. A sequence of images is generated that converges to a final image that is identical to the original image. The theory underlying noncausal prediction and iterative decoding is developed. Convergence properties of the decoding algorithm are studied and conditions for convergence are presented. Distortions to the prediction residual after encoding can be caused by storage requirements, such as quantization and compression and also by errors in transmission. Effects of distortions of the residual on the final decoded image are investigated by introducing several types of distortion of the residual, including (1) alteration of randomly selected bits in the residual, (2) addition of a sinusoidal signal to the residual, (3) quantization of the residual and (4) compression of the residual using lossy Haar wavelet coding. The resulting distortion in the decoded images was generally less for noncausal prediction than for causal prediction, both in terms of PSNR and visual quality. Most noticeably, the streaks found in the decoded Image after causal encoding were absent with noncausal encoding

    Scalable processing and autocovariance computation of big functional data

    Get PDF
    This is the peer reviewed version of the following article: Brisaboa NR, Cao R, Paramá JR, Silva-Coira F. Scalable processing and autocovariance computation of big functional data. Softw Pract Exper. 2018; 48: 123–140 which has been published in final form at https://doi.org/10.1002/spe.2524 . This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions. This article may not be enhanced, enriched or otherwise transformed into a derivative work, without express permission from Wiley or by statutory rights under applicable legislation. Copyright notices must not be removed, obscured or modified. The article must be linked to Wiley’s version of record on Wiley Online Library and any embedding, framing or otherwise making available the article or pages thereof by third parties from platforms, services and websites other than Wiley Online Library must be prohibited.[Abstract]: This paper presents 2 main contributions. The first is a compact representation of huge sets of functional data or trajectories of continuous-time stochastic processes, which allows keeping the data always compressed even during the processing in main memory. It is oriented to facilitate the efficient computation of the sample autocovariance function without a previous decompression of the data set, by using only partial local decoding. The second contribution is a new memory-efficient algorithm to compute the sample autocovariance function. The combination of the compact representation and the new memory-efficient algorithm obtained in our experiments the following benefits. The compressed data occupy in the disk 75% of the space needed by the original data. The computation of the autocovariance function used up to 13 times less main memory, and run 65% faster than the classical method implemented, for example, in the R package.This work was supported by the Ministerio de Economía y Competitividad (PGE and FEDER) under grants [TIN2016-78011-C4-1-R; MTM2014-52876-R; TIN2013-46238-C4-3-R], Centro para el desarrollo Tecnológico e Industrial MINECO [IDI-20141259; ITC-20151247; ITC-20151305; ITC-20161074]; Xunta de Galicia (cofounded with FEDER) under Grupos de Referencia Competitiva grant ED431C-2016-015; Xunta de Galicia-Consellería de Cultura, Educación e Ordenación Universitaria (cofounded with FEDER) under Redes grants R2014/041, ED341D R2016/045; Xunta de Galicia-Consellería de Cultura, Educación e Ordenación Universitaria (cofounded with FEDER) under Centro Singular de Investigación de Galicia grant ED431G/01.Xunta de Galicia; D431C-2016-015Xunta de Galicia; R2014/041Xunta de Galicia; ED341D R2016/045Xunta de Galicia; ED431G/0

    Compressed Indexes for String Searching in Labeled Graphs

    Get PDF
    Storing and searching large labeled graphs is indeed becoming a key issue in the design of space/time efficient online platforms indexing modern social networks or knowledge graphs. But, as far as we know, all these results are limited to design compressed graph indexes which support basic access operations onto the link structure of the input graph, such as: given a node u, return the adjacency list of u. This paper takes inspiration from the Facebook Unicorn's platform and proposes some compressed-indexing schemes for large graphs whose nodes are labeled with strings of variable length - i.e., node's attributes such as user's (nick-)name - that support sophisticated search operations which involve both the linked structure of the graph and the string content of its nodes. An extensive experimental evaluation over real social networks will show the time and space efficiency of the proposed indexing schemes and their query processing algorithms

    Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions

    Get PDF
    Low-rank matrix approximations, such as the truncated singular value decomposition and the rank-revealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets. This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed—either explicitly or implicitly—to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired low-rank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, robustness, and/or speed. These claims are supported by extensive numerical experiments and a detailed error analysis. The specific benefits of randomized techniques depend on the computational environment. Consider the model problem of finding the k dominant components of the singular value decomposition of an m × n matrix. (i) For a dense input matrix, randomized algorithms require O(mn log(k)) floating-point operations (flops) in contrast to O(mnk) for classical algorithms. (ii) For a sparse input matrix, the flop count matches classical Krylov subspace methods, but the randomized approach is more robust and can easily be reorganized to exploit multiprocessor architectures. (iii) For a matrix that is too large to fit in fast memory, the randomized techniques require only a constant number of passes over the data, as opposed to O(k) passes for classical algorithms. In fact, it is sometimes possible to perform matrix approximation with a single pass over the data
    corecore