Search CORE

111,506 research outputs found

Compressed Subsequence Matching and Packed Tree Coloring

Author: A. Tiskin
A. Tiskin
D.D. Sleator
G. Das
H. Mannila
J. Ziv
J. Ziv
M. Charikar
M. Crochemore
M. Thorup
M.A. Bender
M.L. Fredman
N.J. Larsson
O. Berkman
P. Cégielski
P. Cégielski
P. Ferragina
P.F. Dietz
R.A. Baeza-Yates
S. Abiteboul
S. Alstrup
S. Alstrup
S. Alstrup
T. Yamamoto
W. Rytter
Z. Troníček
Publication venue
Publication date: 01/01/2014
Field of study

We present a new algorithm for subsequence matching in grammar compressed strings. Given a grammar of size

n

compressing a string of size

N

and a pattern string of size

m

over an alphabet of size

\sigma

, our algorithm uses

O(n+\frac{n\sigma}{w})

space and

O(n+\frac{n\sigma}{w}+m\log N\log w\cdot occ)

O(n+\frac{n\sigma}{w}\log w+m\log N\cdot occ)

time. Here

w

is the word size and

occ

is the number of occurrences of the pattern. Our algorithm uses less space than previous algorithms and is also faster for

occ=o(\frac{n}{\log N})

occurrences. The algorithm uses a new data structure that allows us to efficiently find the next occurrence of a given character after a given position in a compressed string. This data structure in turn is based on a new data structure for the tree color problem, where the node colors are packed in bit strings.Comment: To appear at CPM '1

arXiv.org e-Print Archive

CiteSeerX

Crossref

Online Research Database In Technology

Prospects and limitations of full-text index structures in genome analysis

Author: Dawyndt Peter
De Baets Bernard
Fack Veerle
Vyverman Michaël
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared

Ghent University Academic Bibliography

PubMed Central

Efficient Pattern Matching in Python

Author: Bachmair L.
Bachmair L.
Behnel S.
Clausen M.
Kirchner H.
Klop J. W.
Kounalis E.
Nedjah N.
Rivoal F.
Robie J.
Publication venue
Publication date: 29/09/2017
Field of study

Pattern matching is a powerful tool for symbolic computations. Applications include term rewriting systems, as well as the manipulation of symbolic expressions, abstract syntax trees, and XML and JSON data. It also allows for an intuitive description of algorithms in the form of rewrite rules. We present the open source Python module MatchPy, which offers functionality and expressiveness similar to the pattern matching in Mathematica. In particular, it includes syntactic pattern matching, as well as matching for commutative and/or associative functions, sequence variables, and matching with constraints. MatchPy uses new and improved algorithms to efficiently find matches for large pattern sets by exploiting similarities between patterns. The performance of MatchPy is investigated on several real-world problems

arXiv.org e-Print Archive

Crossref

Fast multi-image matching via density-based clustering

Author: Daniilidis K.
Esteves C.
Tron Roberto
Zhou X.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

We consider the problem of finding consistent matches across multiple images. Previous state-of-the-art solutions use constraints on cycles of matches together with convex optimization, leading to computationally intensive iterative algorithms. In this paper, we propose a clustering-based formulation. We first rigorously show its equivalence with the previous one, and then propose QuickMatch, a novel algorithm that identifies multi-image matches from a density function in feature space. We use the density to order the points in a tree, and then extract the matches by breaking this tree using feature distances and measures of distinctiveness. Our algorithm outperforms previous state-of-the-art methods (such as MatchALS) in accuracy, and it is significantly faster (up to 62 times faster on some bechmarks), and can scale to large datasets (with more than twenty thousands features).Accepted manuscriptSupporting documentatio

Crossref

Boston University Institutional Repository (OpenBU)

Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

Author: Cohn Trevor
Haffari Gholamreza
Petri Matthias
Shareghi Ehsan
Publication venue
Publication date: 01/01/2016
Field of study

Efficient methods for storing and querying are critical for scaling high-order n-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500x, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).Comment: 14 pages in Transactions of the Association for Computational Linguistics (TACL) 201

arXiv.org e-Print Archive

University of Melbourne Institutional Repository

Monash University Research Portal

Document Retrieval on Repetitive Collections

Author: C. Hernández
F. Claude
F. Claude
G. Navarro
G. Navarro
H. Ferrada
K. Sadakane
T. Gagie
U. Manber
V. Mäkinen
Publication venue
Publication date: 01/01/2014
Field of study

Document retrieval aims at finding the most important documents where a pattern appears in a collection of strings. Traditional pattern-matching techniques yield brute-force document retrieval solutions, which has motivated the research on tailored indexes that offer near-optimal performance. However, an experimental study establishing which alternatives are actually better than brute force, and which perform best depending on the collection characteristics, has not been carried out. In this paper we address this shortcoming by exploring the relationship between the nature of the underlying collection and the performance of current methods. Via extensive experiments we show that established solutions are often beaten in practice by brute-force alternatives. We also design new methods that offer superior time/space trade-offs, particularly on repetitive collections.Comment: Accepted to ESA 2014. Implementation and experiments at http://www.cs.helsinki.fi/group/suds/rlcsa

arXiv.org e-Print Archive

Crossref