72,671 research outputs found
Re-Pair Compression of Inverted Lists
Compression of inverted lists with methods that support fast intersection
operations is an active research topic. Most compression schemes rely on
encoding differences between consecutive positions with techniques that favor
small numbers. In this paper we explore a completely different alternative: We
use Re-Pair compression of those differences. While Re-Pair by itself offers
fast decompression at arbitrary positions in main and secondary memory, we
introduce variants that in addition speed up the operations required for
inverted list intersection. We compare the resulting data structures with
several recent proposals under various list intersection algorithms, to
conclude that our Re-Pair variants offer an interesting time/space tradeoff for
this problem, yet further improvements are required for it to improve upon the
state of the art
Prospects and limitations of full-text index structures in genome analysis
The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared
Towards a compact representation of temporal rasters
Big research efforts have been devoted to efficiently manage spatio-temporal
data. However, most works focused on vectorial data, and much less, on raster
data. This work presents a new representation for raster data that evolve along
time named Temporal k^2 raster. It faces the two main issues that arise when
dealing with spatio-temporal data: the space consumption and the query response
times. It extends a compact data structure for raster data in order to manage
time and thus, it is possible to query it directly in compressed form, instead
of the classical approach that requires a complete decompression before any
manipulation. In addition, in the same compressed space, the new data structure
includes two indexes: a spatial index and an index on the values of the cells,
thus becoming a self-index for raster data.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sklodowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941. Published in SPIRE 201
Efficient and Effective Query Auto-Completion
Query Auto-Completion (QAC) is an ubiquitous feature of modern textual search
systems, suggesting possible ways of completing the query being typed by the
user. Efficiency is crucial to make the system have a real-time responsiveness
when operating in the million-scale search space. Prior work has extensively
advocated the use of a trie data structure for fast prefix-search operations in
compact space. However, searching by prefix has little discovery power in that
only completions that are prefixed by the query are returned. This may impact
negatively the effectiveness of the QAC system, with a consequent monetary loss
for real applications like Web Search Engines and eCommerce. In this work we
describe the implementation that empowers a new QAC system at eBay, and discuss
its efficiency/effectiveness in relation to other approaches at the
state-of-the-art. The solution is based on the combination of an inverted index
with succinct data structures, a much less explored direction in the
literature. This system is replacing the previous implementation based on
Apache SOLR that was not always able to meet the required
service-level-agreement.Comment: Published in SIGIR 202
Decoding billions of integers per second through vectorization
In many important applications -- such as search engines and relational
database systems -- data is stored in the form of arrays of integers. Encoding
and, most importantly, decoding of these arrays consumes considerable CPU time.
Therefore, substantial effort has been made to reduce costs associated with
compression and decompression. In particular, researchers have exploited the
superscalar nature of modern processors and SIMD instructions. Nevertheless, we
introduce a novel vectorized scheme called SIMD-BP128 that improves over
previously proposed vectorized approaches. It is nearly twice as fast as the
previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the
same time, SIMD-BP128 saves up to 2 bits per integer. For even better
compression, we propose another new vectorized scheme (SIMD-FastPFOR) that has
a compression ratio within 10% of a state-of-the-art scheme (Simple-8b) while
being two times faster during decoding.Comment: For software, see https://github.com/lemire/FastPFor, For data, see
http://boytsov.info/datasets/clueweb09gap
Tree Compression with Top Trees Revisited
We revisit tree compression with top trees (Bille et al, ICALP'13) and
present several improvements to the compressor and its analysis. By
significantly reducing the amount of information stored and guiding the
compression step using a RePair-inspired heuristic, we obtain a fast compressor
achieving good compression ratios, addressing an open problem posed by Bille et
al. We show how, with relatively small overhead, the compressed file can be
converted into an in-memory representation that supports basic navigation
operations in worst-case logarithmic time without decompression. We also show a
much improved worst-case bound on the size of the output of top-tree
compression (answering an open question posed in a talk on this algorithm by
Weimann in 2012).Comment: SEA 201
- …