Search CORE

4,246 research outputs found

Universal Indexes for Highly Repetitive Document Collections

Author: Claude Francisco
Fariña Antonio
Martínez-Prieto Miguel A.
Navarro Gonzalo
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

arXiv.org e-Print Archive

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Repositorio Académico de la Universidad de Chile

Re-Pair Compression of Inverted Lists

Author: Claude Francisco
Farina Antonio
Navarro Gonzalo
Publication venue
Publication date: 01/01/2009
Field of study

Compression of inverted lists with methods that support fast intersection operations is an active research topic. Most compression schemes rely on encoding differences between consecutive positions with techniques that favor small numbers. In this paper we explore a completely different alternative: We use Re-Pair compression of those differences. While Re-Pair by itself offers fast decompression at arbitrary positions in main and secondary memory, we introduce variants that in addition speed up the operations required for inverted list intersection. We compare the resulting data structures with several recent proposals under various list intersection algorithms, to conclude that our Re-Pair variants offer an interesting time/space tradeoff for this problem, yet further improvements are required for it to improve upon the state of the art

arXiv.org e-Print Archive

CiteSeerX

Repositorio Académico de la Universidad de Chile

From Query-By-Keyword to Query-By-Example: LinkedIn Talent Search Approach

Author: Dialani Vijay
Gupta Abhishek
Ha-Thuc Viet
Sinha Shakti
Wu Xianren
Yan Yan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 02/09/2017
Field of study

One key challenge in talent search is to translate complex criteria of a hiring position into a search query, while it is relatively easy for a searcher to list examples of suitable candidates for a given position. To improve search efficiency, we propose the next generation of talent search at LinkedIn, also referred to as Search By Ideal Candidates. In this system, a searcher provides one or several ideal candidates as the input to hire for a given position. The system then generates a query based on the ideal candidates and uses it to retrieve and rank results. Shifting from the traditional Query-By-Keyword to this new Query-By-Example system poses a number of challenges: How to generate a query that best describes the candidates? When moving to a completely different paradigm, how does one leverage previous product logs to learn ranking models and/or evaluate the new system with no existing usage logs? Finally, given the different nature between the two search paradigms, the ranking features typically used for Query-By-Keyword systems might not be optimal for Query-By-Example. This paper describes our approach to solving these challenges. We present experimental results confirming the effectiveness of the proposed solution, particularly on query building and search ranking tasks. As of writing this paper, the new system has been available to all LinkedIn members

arXiv.org e-Print Archive

Crossref

Grammar compressed sequences with rank/select support

Author: Brisaboa Nieves R.
Navarro Gonzalo
Ordóñez Alberto
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

An early partial version of this paper appeared in Proc. SPIRE 2014: G. Navarro, A. Ordóñez Grammar compressed sequences with rank/select support, Proc. 21st International Symposium on String Processing and Information Retrieval, LNCS, SPIRE, vol. 8799 (2014), pp. 31–44The final publication is available at Springer via http://dx.doi.org/10.1016/j.jda.2016.10.001[Abstract] Sequence representations supporting not only direct access to their symbols, but also rank/select operations, are a fundamental building block in many compressed data structures. Several recent applications need to represent highly repetitive sequences, and classical statistical compression proves ineffective. We introduce, instead, grammar-based representations for repetitive sequences, which use up to 6% of the space needed by statistically compressed representations, and support direct access and rank/select operations within tens of microseconds. We demonstrate the impact of our structures in text indexing applications.Chile. Fondo Nacional de Desarrollo Científico y Tecnológico; 140796Ministerio de Economía, Industria y Competitividad; 00645663/ITC-20133062Ministerio de Economía, Industria y Competitividad; TIN2009-14560-C03-02Ministerio de Economía, Industria y Competitividad; TIN2010-21246-C02-01Ministerio de Economía, Industria y Competitividad; TIN2013-46238-C4-3-RMinisterio de Economía, Industria y Competitividad; TIN2013-47090-C3-3-PMinisterio de Economía, Industria y Competitividad; AP2010-6038Xunta de Galicia; GRC2013/05

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Repositorio Académico de la Universidad de Chile

Stochastic Query Covering for Fast Approximate Document Retrieval

Author: Anagnostopoulos Aristidis
Becchetti Luca
Ida Mele
Ilaria Bordino
Leonardi Stefano
Piotr Sankowski
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2015
Field of study

We design algorithms that, given a collection of documents and a distribution over user queries, return a small subset of the document collection in such a way that we can efficiently provide high-quality answers to user queries using only the selected subset. This approach has applications when space is a constraint or when the query-processing time increases significantly with the size of the collection. We study our algorithms through the lens of stochastic analysis and prove that even though they use only a small fraction of the entire collection, they can provide answers to most user queries, achieving a performance close to the optimal. To complement our theoretical findings, we experimentally show the versatility of our approach by considering two important cases in the context of Web search. In the first case, we favor the retrieval of documents that are relevant to the query, whereas in the second case we aim for document diversification. Both the theoretical and the experimental analysis provide strong evidence of the potential value of query covering in diverse application scenarios

Archivio della ricerca- Università di Roma La Sapienza

MPG.PuRe