39,914 research outputs found

    Document Retrieval on Repetitive Collections

    Full text link
    Document retrieval aims at finding the most important documents where a pattern appears in a collection of strings. Traditional pattern-matching techniques yield brute-force document retrieval solutions, which has motivated the research on tailored indexes that offer near-optimal performance. However, an experimental study establishing which alternatives are actually better than brute force, and which perform best depending on the collection characteristics, has not been carried out. In this paper we address this shortcoming by exploring the relationship between the nature of the underlying collection and the performance of current methods. Via extensive experiments we show that established solutions are often beaten in practice by brute-force alternatives. We also design new methods that offer superior time/space trade-offs, particularly on repetitive collections.Comment: Accepted to ESA 2014. Implementation and experiments at http://www.cs.helsinki.fi/group/suds/rlcsa

    A Compression Technique Exploiting References for Data Synchronization Services

    Get PDF
    Department of Computer Science and EngineeringIn a variety of network applications, there exists significant amount of shared data between two end hosts. Examples include data synchronization services that replicate data from one node to another. Given that shared data may have high correlation with new data to transmit, we question how such shared data can be best utilized to improve the efficiency of data transmission. To answer this, we develop an encoding technique, SyncCoding, that effectively replaces bit sequences of the data to be transmitted with the pointers to their matching bit sequences in the shared data so called references. By doing so, SyncCoding can reduce data traffic, speed up data transmission, and save energy consumption for transmission. Our evaluations of SyncCoding implemented in Linux show that it outperforms existing popular encoding techniques, Brotli, LZMA, Deflate, and Deduplication. The gains of SyncCoding over those techniques in the perspective of data size after compression in a cloud storage scenario are about 12.4%, 20.1%, 29.9%, and 61.2%, and are about 78.3%, 79.6%, 86.1%, and 92.9% in a web browsing scenario, respectively.ope

    From Theory to Practice: Plug and Play with Succinct Data Structures

    Full text link
    Engineering efficient implementations of compact and succinct structures is a time-consuming and challenging task, since there is no standard library of easy-to- use, highly optimized, and composable components. One consequence is that measuring the practical impact of new theoretical proposals is a difficult task, since older base- line implementations may not rely on the same basic components, and reimplementing from scratch can be very time-consuming. In this paper we present a framework for experimentation with succinct data structures, providing a large set of configurable components, together with tests, benchmarks, and tools to analyze resource requirements. We demonstrate the functionality of the framework by recomposing succinct solutions for document retrieval.Comment: 10 pages, 4 figures, 3 table
    corecore