Search CORE

5 research outputs found

Document retrieval on repetitive string collections

Author: Gagie Travis
Hartikainen Aleksi
Karhu Kalle
Kärkkäinen Juha
Navarro Gonzalo
Puglisi Simon J.
Sirén Jouni
Publication venue
Publication date: 01/01/2017
Field of study

Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.Peer reviewe

arXiv.org e-Print Archive

Crossref

Springer - Publisher Connector

eScholarship - University of California

Helsingin yliopiston digitaalinen arkisto

Repositorio Académico de la Universidad de Chile

Tailoring r-index for Document Listing Towards Metagenomics Applications

Author: A Jez
D Belazzougui
D Carroll
D Cobas
DE Wood
DH Huson
F Claude
G Navarro
G Navarro
G Navarro
J Fischer
K Sadakane
L Schaeffer
M Charikar
ML Fredman
MS Lindner
N Välimäki
NL Bray
S Gog
T Gagie
T Gagie
T Gagie
T Gagie
U Manber
V Mäkinen
W Rytter
Z Iqbal
Publication venue: Springer Nature Switzerland AG
Publication date: 01/01/2020
Field of study

A basic problem in metagenomics is to assign a sequenced read to the correct species in the reference collection. In typical applications in genomic epidemiology and viral metagenomics the reference collection consists of a set of species with each species represented by its highly similar strains. It has been recently shown that accurate read assignment can be achieved with k-mer hashing-based pseudoalignment: a read is assigned to species A if each of its k-mer hits to a reference collection is located only on strains of A. We study the underlying primitives required in pseudoalignment and related tasks. We propose three space-efficient solutions building upon the document listing with frequencies problem. All the solutions use an r-index (Gagie et al., SODA 2018) as an underlying index structure for the text obtained as concatenation of the set of species, as well as for each species. Given t species whose concatenation length is n, and whose Burrows-Wheeler transform contains r runs, our first solution, based on a grammar-compressed document array with precomputed queries at non terminal symbols, reports the frequencies for the distinct documents in which the pattern of length m occurs in time. Our second solution is also based on a grammar-compressed document array, but enhanced with bitvectors and reports the frequencies in time, over a machine with wordsize w. Our third solution, based on the interleaved LCP array, answers the same query in time. We implemented our solutions and tested them on real-world and synthetic datasets. The results show that all the solutions are fast on highly-repetitive data, and the size overhead introduced by the indexes are comparable with the size of the r-index.Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

Document retrieval on repetitive string collections.

Author: Gagie Travis,
Publication venue
Publication date: 15/09/2023
Field of study

Ezid

Recommended from our members

Document retrieval on repetitive string collections.

Author: Gagie Travis
Hartikainen Aleksi
Karhu Kalle
Kärkkäinen Juha
Navarro Gonzalo
Puglisi Simon
Sirén Jouni
Publication venue: eScholarship, University of California
Publication date: 01/01/2017
Field of study

Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple [Formula: see text] model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case

eScholarship - University of California

Document retrieval on repetitive string collections

Author: Aleksi Hartikainen
C Hernández
F Claude
F Claude
G Navarro
Gonzalo Navarro
HH Do
J Dhaliwal
J Fischer
Jouni Sirén
Juha Kärkkäinen
K Sadakane
Kalle Karhu
M Rochkind
NJ Larsson
R Baeza-Yates
S Büttcher
S Kreft
Simon J. Puglisi
T Gagie
Travis Gagie
U Manber
V Mäkinen
W Szpankowski
WK Hon
ZD Stephens
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref