Search CORE

22,363 research outputs found

Linear-time String Indexing and Analysis in Small Space

Author: Belazzougui Djamal
Cunial Fabio
Karkkainen Juha
Makinen Veli
Publication venue
Publication date: 01/03/2020
Field of study

The field of succinct data structures has flourished over the past 16 years. Starting from the compressed suffix array by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input size in bits. In many large-scale applications, the construction of the index and its usage need to be considered as one unit of computation. For example, one can compare two genomes by building a common index for their concatenation and by detecting common substructures by querying the index. Efficient string indexing and analysis in small space lies also at the core of a number of primitives in the data-intensive field of high-throughput DNA sequencing. We report the following advances in string indexing and analysis: We show that the BWT of a string T is an element of {1, . . . , sigma}(n) can be built in deterministic O(n) time using just O(n log sigma) bits of space, where sigma We also show how to build many of the existing indexes based on the BWT, such as the compressed suffix array, the compressed suffix tree, and the bidirectional BWT index, in randomized O(n) time and in O(n log sigma) bits of space. The previously fastest construction algorithms for BWT, compressed suffix array and compressed suffix tree, which used O(n log sigma) bits of space, took O(n log log sigma) time for the first two structures and O(n log(epsilon) n) time for the third, where. is any positive constant smaller than one. Alternatively, the BWT could be previously built in linear time if one was willing to spend O(n log sigma log log(sigma) n) bits of space. Contrary to the state-of-the-art, our bidirectional BWT index supports every operation in constant time per element in its output.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

String Indexing for Patterns with Wildcards

Author: A. Tam
B. Chazelle
D. Harel
D. Tsur
G. Chen
G. Landau
G. Landau
G. Navarro
H.L. Chan
K. Hofmann
L.P. Coelho
M. Lewenstein
M. Maas
M.L. Fredman
P. Bille
P. Bille
P. Clifford
T.-W. Lam
Z. Galil
Publication venue
Publication date: 01/01/2012
Field of study

We consider the problem of indexing a string

t

of length

n

to report the occurrences of a query pattern

p

containing

m

characters and

j

wildcards. Let

occ

be the number of occurrences of

p

t

, and

\sigma

the size of the alphabet. We obtain the following results. - A linear space index with query time

O(m+\sigma^j \log \log n + occ)

. This significantly improves the previously best known linear space index by Lam et al. [ISAAC 2007], which requires query time

\Theta(jn)

in the worst case. - An index with query time

O(m+j+occ)

using space

O(\sigma^{k^2} n \log^k \log n)

, where

k

is the maximum number of wildcards allowed in the pattern. This is the first non-trivial bound with this query time. - A time-space trade-off, generalizing the index by Cole et al. [STOC 2004]. We also show that these indexes can be generalized to allow variable length gaps in the pattern. Our results are obtained using a novel combination of well-known and new techniques, which could be of independent interest

arXiv.org e-Print Archive

Crossref

Online Research Database In Technology

Data Structure Lower Bounds for Document Indexing Problems

Author: Afshani Peyman
Nielsen Jesper Sindahl
Publication venue
Publication date: 01/01/2016
Field of study

We study data structure problems related to document indexing and pattern matching queries and our main contribution is to show that the pointer machine model of computation can be extremely useful in proving high and unconditional lower bounds that cannot be obtained in any other known model of computation with the current techniques. Often our lower bounds match the known space-query time trade-off curve and in fact for all the problems considered, there is a very good and reasonable match between the our lower bounds and the known upper bounds, at least for some choice of input parameters. The problems that we consider are set intersection queries (both the reporting variant and the semi-group counting variant), indexing a set of documents for two-pattern queries, or forbidden- pattern queries, or queries with wild-cards, and indexing an input set of gapped-patterns (or two-patterns) to find those matching a document given at the query time.Comment: Full version of the conference version that appeared at ICALP 2016, 25 page

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Prospects and limitations of full-text index structures in genome analysis

Author: Dawyndt Peter
De Baets Bernard
Fack Veerle
Vyverman Michaël
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared

Ghent University Academic Bibliography

PubMed Central

Comparison Of Modified Dual Ternary Indexing And Multi-Key Hashing Algorithms For Music Information Retrieval

Author: Amudha A.
Karthiga S.
Sridhar Rajeswari
T V Geetha
Publication venue: 'Academy and Industry Research Collaboration Center (AIRCC)'
Publication date: 29/07/2010
Field of study

In this work we have compared two indexing algorithms that have been used to index and retrieve Carnatic music songs. We have compared a modified algorithm of the Dual ternary indexing algorithm for music indexing and retrieval with the multi-key hashing indexing algorithm proposed by us. The modification in the dual ternary algorithm was essential to handle variable length query phrase and to accommodate features specific to Carnatic music. The dual ternary indexing algorithm is adapted for Carnatic music by segmenting using the segmentation technique for Carnatic music. The dual ternary algorithm is compared with the multi-key hashing algorithm designed by us for indexing and retrieval in which features like MFCC, spectral flux, melody string and spectral centroid are used as features for indexing data into a hash table. The way in which collision resolution was handled by this hash table is different than the normal hash table approaches. It was observed that multi-key hashing based retrieval had a lesser time complexity than dual-ternary based indexing The algorithms were also compared for their precision and recall in which multi-key hashing had a better recall than modified dual ternary indexing for the sample data considered.Comment: 11 pages, 5 figure

arXiv.org e-Print Archive

Crossref

Managing Unbounded-Length Keys in Comparison-Driven Data Structures with Applications to On-Line Indexing

Author: Amir Amihood
Franceschini Gianni
Grossi Roberto
Kopelowitz Tsvi
Lewenstein Moshe
Lewenstein Noa
Publication venue
Publication date: 03/06/2013
Field of study

This paper presents a general technique for optimally transforming any dynamic data structure that operates on atomic and indivisible keys by constant-time comparisons, into a data structure that handles unbounded-length keys whose comparison cost is not a constant. Examples of these keys are strings, multi-dimensional points, multiple-precision numbers, multi-key data (e.g.~records), XML paths, URL addresses, etc. The technique is more general than what has been done in previous work as no particular exploitation of the underlying structure of is required. The only requirement is that the insertion of a key must identify its predecessor or its successor. Using the proposed technique, online suffix tree can be constructed in worst case time

O(\log n)

per input symbol (as opposed to amortized

O(\log n)

time per symbol, achieved by previously known algorithms). To our knowledge, our algorithm is the first that achieves

O(\log n)

worst case time per input symbol. Searching for a pattern of length

m

in the resulting suffix tree takes

O(\min(m\log |\Sigma|, m + \log n) + tocc)

time, where

tocc

is the number of occurrences of the pattern. The paper also describes more applications and show how to obtain alternative methods for dealing with suffix sorting, dynamic lowest common ancestors and order maintenance

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Pisa

Archivio della ricerca- Università di Roma La Sapienza

Suffix Tree of Alignment: An Efficient Index for Similar Data

Author: A. Amir
D. Gusfield
E. Ukkonen
E.M. McCreight
G. Navarro
H.H. Do
J. Ziv
K. Sadakane
M. Crochemore
M. Farach-Colton
P. Bille
R. Grossi
R.A. Baeza-Yates
S. Huang
S. Karlin
S. Kuruppu
V. Levenshtein
V. Mäkinen
V. Mäkinen
Publication venue
Publication date: 01/01/2013
Field of study

We consider an index data structure for similar strings. The generalized suffix tree can be a solution for this. The generalized suffix tree of two strings

A

and

B

is a compacted trie representing all suffixes in

A

and

B

. It has

|A|+|B|

leaves and can be constructed in

O(|A|+|B|)

time. However, if the two strings are similar, the generalized suffix tree is not efficient because it does not exploit the similarity which is usually represented as an alignment of

A

and

B

. In this paper we propose a space/time-efficient suffix tree of alignment which wisely exploits the similarity in an alignment. Our suffix tree for an alignment of

A

and

B

has

|A| + l_d + l_1

leaves where

l_d

is the sum of the lengths of all parts of

B

different from

A

and

l_1

is the sum of the lengths of some common parts of

A

and

B

. We did not compromise the pattern search to reduce the space. Our suffix tree can be searched for a pattern

P

O(|P|+occ)

time where

occ

is the number of occurrences of

P

A

and

B

. We also present an efficient algorithm to construct the suffix tree of alignment. When the suffix tree is constructed from scratch, the algorithm requires