4,246 research outputs found
Universal Indexes for Highly Repetitive Document Collections
Indexing highly repetitive collections has become a relevant problem with the
emergence of large repositories of versioned documents, among other
applications. These collections may reach huge sizes, but are formed mostly of
documents that are near-copies of others. Traditional techniques for indexing
these collections fail to properly exploit their regularities in order to
reduce space.
We introduce new techniques for compressing inverted indexes that exploit
this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar
compression of the differential inverted lists, instead of the usual practice
of gap-encoding them. We show that, in this highly repetitive setting, our
compression methods significantly reduce the space obtained with classical
techniques, at the price of moderate slowdowns. Moreover, our best methods are
universal, that is, they do not need to know the versioning structure of the
collection, nor that a clear versioning structure even exists.
We also introduce compressed self-indexes in the comparison. These are
designed for general strings (not only natural language texts) and represent
the text collection plus the index structure (not an inverted index) in
integrated form. We show that these techniques can compress much further, using
a small fraction of the space required by our new inverted indexes. Yet, they
are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
Re-Pair Compression of Inverted Lists
Compression of inverted lists with methods that support fast intersection
operations is an active research topic. Most compression schemes rely on
encoding differences between consecutive positions with techniques that favor
small numbers. In this paper we explore a completely different alternative: We
use Re-Pair compression of those differences. While Re-Pair by itself offers
fast decompression at arbitrary positions in main and secondary memory, we
introduce variants that in addition speed up the operations required for
inverted list intersection. We compare the resulting data structures with
several recent proposals under various list intersection algorithms, to
conclude that our Re-Pair variants offer an interesting time/space tradeoff for
this problem, yet further improvements are required for it to improve upon the
state of the art
From Query-By-Keyword to Query-By-Example: LinkedIn Talent Search Approach
One key challenge in talent search is to translate complex criteria of a
hiring position into a search query, while it is relatively easy for a searcher
to list examples of suitable candidates for a given position. To improve search
efficiency, we propose the next generation of talent search at LinkedIn, also
referred to as Search By Ideal Candidates. In this system, a searcher provides
one or several ideal candidates as the input to hire for a given position. The
system then generates a query based on the ideal candidates and uses it to
retrieve and rank results. Shifting from the traditional Query-By-Keyword to
this new Query-By-Example system poses a number of challenges: How to generate
a query that best describes the candidates? When moving to a completely
different paradigm, how does one leverage previous product logs to learn
ranking models and/or evaluate the new system with no existing usage logs?
Finally, given the different nature between the two search paradigms, the
ranking features typically used for Query-By-Keyword systems might not be
optimal for Query-By-Example. This paper describes our approach to solving
these challenges. We present experimental results confirming the effectiveness
of the proposed solution, particularly on query building and search ranking
tasks. As of writing this paper, the new system has been available to all
LinkedIn members
Grammar compressed sequences with rank/select support
An early partial version of this paper appeared in Proc. SPIRE 2014: G. Navarro, A. Ordóñez Grammar compressed sequences with rank/select support, Proc. 21st International Symposium on String Processing and Information Retrieval, LNCS, SPIRE, vol. 8799 (2014), pp. 31–44The final publication is available at Springer via http://dx.doi.org/10.1016/j.jda.2016.10.001[Abstract] Sequence representations supporting not only direct access to their symbols, but also rank/select operations, are a fundamental building block in many compressed data structures. Several recent applications need to represent highly repetitive sequences, and classical statistical compression proves ineffective. We introduce, instead, grammar-based representations for repetitive sequences, which use up to 6% of the space needed by statistically compressed representations, and support direct access and rank/select operations within tens of microseconds. We demonstrate the impact of our structures in text indexing applications.Chile. Fondo Nacional de Desarrollo CientÃfico y Tecnológico; 140796Ministerio de EconomÃa, Industria y Competitividad; 00645663/ITC-20133062Ministerio de EconomÃa, Industria y Competitividad; TIN2009-14560-C03-02Ministerio de EconomÃa, Industria y Competitividad; TIN2010-21246-C02-01Ministerio de EconomÃa, Industria y Competitividad; TIN2013-46238-C4-3-RMinisterio de EconomÃa, Industria y Competitividad; TIN2013-47090-C3-3-PMinisterio de EconomÃa, Industria y Competitividad; AP2010-6038Xunta de Galicia; GRC2013/05
Stochastic Query Covering for Fast Approximate Document Retrieval
We design algorithms that, given a collection of documents and a distribution over user queries, return a
small subset of the document collection in such a way that we can efficiently provide high-quality answers
to user queries using only the selected subset. This approach has applications when space is a constraint
or when the query-processing time increases significantly with the size of the collection. We study our
algorithms through the lens of stochastic analysis and prove that even though they use only a small fraction
of the entire collection, they can provide answers to most user queries, achieving a performance close to the
optimal. To complement our theoretical findings, we experimentally show the versatility of our approach
by considering two important cases in the context of Web search. In the first case, we favor the retrieval of
documents that are relevant to the query, whereas in the second case we aim for document diversification.
Both the theoretical and the experimental analysis provide strong evidence of the potential value of query
covering in diverse application scenarios
- …