111,506 research outputs found

    Compressed Subsequence Matching and Packed Tree Coloring

    Get PDF
    We present a new algorithm for subsequence matching in grammar compressed strings. Given a grammar of size nn compressing a string of size NN and a pattern string of size mm over an alphabet of size σ\sigma, our algorithm uses O(n+nσw)O(n+\frac{n\sigma}{w}) space and O(n+nσw+mlogNlogwocc)O(n+\frac{n\sigma}{w}+m\log N\log w\cdot occ) or O(n+nσwlogw+mlogNocc)O(n+\frac{n\sigma}{w}\log w+m\log N\cdot occ) time. Here ww is the word size and occocc is the number of occurrences of the pattern. Our algorithm uses less space than previous algorithms and is also faster for occ=o(nlogN)occ=o(\frac{n}{\log N}) occurrences. The algorithm uses a new data structure that allows us to efficiently find the next occurrence of a given character after a given position in a compressed string. This data structure in turn is based on a new data structure for the tree color problem, where the node colors are packed in bit strings.Comment: To appear at CPM '1

    Prospects and limitations of full-text index structures in genome analysis

    Get PDF
    The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared

    Efficient Pattern Matching in Python

    Full text link
    Pattern matching is a powerful tool for symbolic computations. Applications include term rewriting systems, as well as the manipulation of symbolic expressions, abstract syntax trees, and XML and JSON data. It also allows for an intuitive description of algorithms in the form of rewrite rules. We present the open source Python module MatchPy, which offers functionality and expressiveness similar to the pattern matching in Mathematica. In particular, it includes syntactic pattern matching, as well as matching for commutative and/or associative functions, sequence variables, and matching with constraints. MatchPy uses new and improved algorithms to efficiently find matches for large pattern sets by exploiting similarities between patterns. The performance of MatchPy is investigated on several real-world problems

    Fast multi-image matching via density-based clustering

    Full text link
    We consider the problem of finding consistent matches across multiple images. Previous state-of-the-art solutions use constraints on cycles of matches together with convex optimization, leading to computationally intensive iterative algorithms. In this paper, we propose a clustering-based formulation. We first rigorously show its equivalence with the previous one, and then propose QuickMatch, a novel algorithm that identifies multi-image matches from a density function in feature space. We use the density to order the points in a tree, and then extract the matches by breaking this tree using feature distances and measures of distinctiveness. Our algorithm outperforms previous state-of-the-art methods (such as MatchALS) in accuracy, and it is significantly faster (up to 62 times faster on some bechmarks), and can scale to large datasets (with more than twenty thousands features).Accepted manuscriptSupporting documentatio

    Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

    Get PDF
    Efficient methods for storing and querying are critical for scaling high-order n-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500x, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).Comment: 14 pages in Transactions of the Association for Computational Linguistics (TACL) 201

    Document Retrieval on Repetitive Collections

    Full text link
    Document retrieval aims at finding the most important documents where a pattern appears in a collection of strings. Traditional pattern-matching techniques yield brute-force document retrieval solutions, which has motivated the research on tailored indexes that offer near-optimal performance. However, an experimental study establishing which alternatives are actually better than brute force, and which perform best depending on the collection characteristics, has not been carried out. In this paper we address this shortcoming by exploring the relationship between the nature of the underlying collection and the performance of current methods. Via extensive experiments we show that established solutions are often beaten in practice by brute-force alternatives. We also design new methods that offer superior time/space trade-offs, particularly on repetitive collections.Comment: Accepted to ESA 2014. Implementation and experiments at http://www.cs.helsinki.fi/group/suds/rlcsa
    corecore