899 research outputs found
Prospects and limitations of full-text index structures in genome analysis
The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared
Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space
Indexing highly repetitive texts - such as genomic databases, software
repositories and versioned text collections - has become an important problem
since the turn of the millennium. A relevant compressibility measure for
repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms
(BWTs). One of the earliest indexes for repetitive collections, the Run-Length
FM-index, used O(r) space and was able to efficiently count the number of
occurrences of a pattern of length m in the text (in loglogarithmic time per
pattern symbol, with current techniques). However, it was unable to locate the
positions of those occurrences efficiently within a space bounded in terms of
r. In this paper we close this long-standing problem, showing how to extend the
Run-Length FM-index so that it can locate the occ occurrences efficiently
within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m
+ occ), within O(r log log w ({\sigma} + n/r)) space, for a text of length n
over an alphabet of size {\sigma} on a RAM machine with words of w =
{\Omega}(log n) bits. Within that space, our index can also count in optimal
time, O(m). Multiplying the space by O(w/ log {\sigma}), we support count and
locate in O(dm log({\sigma})/we) and O(dm log({\sigma})/we + occ) time, which
is optimal in the packed setting and had not been obtained before in compressed
space. We also describe a structure using O(r log(n/r)) space that replaces the
text and extracts any text substring of length ` in almost-optimal time
O(log(n/r) + ` log({\sigma})/w). Within that space, we similarly provide direct
access to suffix array, inverse suffix array, and longest common prefix array
cells, and extend these capabilities to full suffix tree functionality,
typically in O(log(n/r)) time per operation.Comment: submitted version; optimal count and locate in smaller space: O(r log
log_w(n/r + sigma)
Document Retrieval on Repetitive Collections
Document retrieval aims at finding the most important documents where a
pattern appears in a collection of strings. Traditional pattern-matching
techniques yield brute-force document retrieval solutions, which has motivated
the research on tailored indexes that offer near-optimal performance. However,
an experimental study establishing which alternatives are actually better than
brute force, and which perform best depending on the collection
characteristics, has not been carried out. In this paper we address this
shortcoming by exploring the relationship between the nature of the underlying
collection and the performance of current methods. Via extensive experiments we
show that established solutions are often beaten in practice by brute-force
alternatives. We also design new methods that offer superior time/space
trade-offs, particularly on repetitive collections.Comment: Accepted to ESA 2014. Implementation and experiments at
http://www.cs.helsinki.fi/group/suds/rlcsa
Deterministic sub-linear space LCE data structures with efficient construction
Given a string of symbols, a longest common extension query
asks for the length of the longest common prefix of the
th and th suffixes of . LCE queries have several important
applications in string processing, perhaps most notably to suffix sorting.
Recently, Bille et al. (J. Discrete Algorithms 25:42-50, 2014, Proc. CPM 2015:
65-76) described several data structures for answering LCE queries that offers
a space-time trade-off between data structure size and query time. In
particular, for a parameter , their best deterministic
solution is a data structure of size which allows LCE queries to be
answered in time. However, the construction time for all
deterministic versions of their data structure is quadratic in . In this
paper, we propose a deterministic solution that achieves a similar space-time
trade-off of query time using
space, but significantly improve the construction time to
.Comment: updated titl
Optimal-Time Text Indexing in BWT-runs Bounded Space
Indexing highly repetitive texts --- such as genomic databases, software
repositories and versioned text collections --- has become an important problem
since the turn of the millennium. A relevant compressibility measure for
repetitive texts is , the number of runs in their Burrows-Wheeler Transform
(BWT). One of the earliest indexes for repetitive collections, the Run-Length
FM-index, used space and was able to efficiently count the number of
occurrences of a pattern of length in the text (in loglogarithmic time per
pattern symbol, with current techniques). However, it was unable to locate the
positions of those occurrences efficiently within a space bounded in terms of
. Since then, a number of other indexes with space bounded by other measures
of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size
of the smallest grammar generating the text, the size of the smallest automaton
recognizing the text factors --- have been proposed for efficiently locating,
but not directly counting, the occurrences of a pattern. In this paper we close
this long-standing problem, showing how to extend the Run-Length FM-index so
that it can locate the occurrences efficiently within space (in
loglogarithmic time each), and reaching optimal time within
space, on a RAM machine of bits. Within
space, our index can also count in optimal time .
Raising the space to , we support count and locate in
and time, which is optimal in the
packed setting and had not been obtained before in compressed space. We also
describe a structure using space that replaces the text and
extracts any text substring of length in almost-optimal time
. (...continues...
Heaviest Induced Ancestors and Longest Common Substrings
Suppose we have two trees on the same set of leaves, in which nodes are
weighted such that children are heavier than their parents. We say a node from
the first tree and a node from the second tree are induced together if they
have a common leaf descendant. In this paper we describe data structures that
efficiently support the following heaviest-induced-ancestor query: given a node
from the first tree and a node from the second tree, find an induced pair of
their ancestors with maximum combined weight. Our solutions are based on a
geometric interpretation that enables us to find heaviest induced ancestors
using range queries. We then show how to use these results to build an
LZ-compressed index with which we can quickly find with high probability a
longest substring common to the indexed string and a given pattern
Combined Data Structure for Previous- and Next-Smaller-Values
Let be a static array storing elements from a totally ordered set. We
present a data structure of optimal size at most
bits that allows us to answer the following queries on in constant time,
without accessing : (1) previous smaller value queries, where given an index
, we wish to find the first index to the left of where is strictly
smaller than at , and (2) next smaller value queries, which search to the
right of . As an additional bonus, our data structure also allows to answer
a third kind of query: given indices , find the position of the minimum in
. Our data structure has direct consequences for the space-efficient
storage of suffix trees.Comment: to appear in Theoretical Computer Scienc
- …