1,728 research outputs found

    Tree Compression with Top Trees Revisited

    Get PDF
    We revisit tree compression with top trees (Bille et al, ICALP'13) and present several improvements to the compressor and its analysis. By significantly reducing the amount of information stored and guiding the compression step using a RePair-inspired heuristic, we obtain a fast compressor achieving good compression ratios, addressing an open problem posed by Bille et al. We show how, with relatively small overhead, the compressed file can be converted into an in-memory representation that supports basic navigation operations in worst-case logarithmic time without decompression. We also show a much improved worst-case bound on the size of the output of top-tree compression (answering an open question posed in a talk on this algorithm by Weimann in 2012).Comment: SEA 201

    A data science approach to pattern discovery in complex structures with applications in bioinformatics

    Get PDF
    Pattern discovery aims to find interesting, non-trivial, implicit, previously unknown and potentially useful patterns in data. This dissertation presents a data science approach for discovering patterns or motifs from complex structures, particularly complex RNA structures. RNA secondary and tertiary structure motifs are very important in biological molecules, which play multiple vital roles in cells. A lot of work has been done on RNA motif annotation. However, pattern discovery in RNA structure is less studied. In the first part of this dissertation, an ab initio algorithm, named DiscoverR, is introduced for pattern discovery in RNA secondary structures. This algorithm works by representing RNA secondary structures as ordered labeled trees and performs tree pattern discovery using a quadratic time dynamic programming algorithm. The algorithm is able to identify and extract the largest common substructures from two RNA molecules of different sizes, without prior knowledge of locations and topologies of these substructures. One application of DiscoverR is to locate the RNA structural elements in genomes. Experimental results show that this tool complements the currently used approaches for mining conserved structural RNAs in the human genome. DiscoverR can also be extended to find repeated regions in an RNA secondary structure. Specifically, this extended method is used to detect structural repeats in the 3\u27-untranslated region of a protein kinase gene

    Prospects and limitations of full-text index structures in genome analysis

    Get PDF
    The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared

    Tree-Independent Dual-Tree Algorithms

    Full text link
    Dual-tree algorithms are a widely used class of branch-and-bound algorithms. Unfortunately, developing dual-tree algorithms for use with different trees and problems is often complex and burdensome. We introduce a four-part logical split: the tree, the traversal, the point-to-point base case, and the pruning rule. We provide a meta-algorithm which allows development of dual-tree algorithms in a tree-independent manner and easy extension to entirely new types of trees. Representations are provided for five common algorithms; for k-nearest neighbor search, this leads to a novel, tighter pruning bound. The meta-algorithm also allows straightforward extensions to massively parallel settings.Comment: accepted in ICML 201

    Minimal Suffix and Rotation of a Substring in Optimal Time

    Get PDF
    For a text given in advance, the substring minimal suffix queries ask to determine the lexicographically minimal non-empty suffix of a substring specified by the location of its occurrence in the text. We develop a data structure answering such queries optimally: in constant time after linear-time preprocessing. This improves upon the results of Babenko et al. (CPM 2014), whose trade-off solution is characterized by Θ(nlogn)\Theta(n\log n) product of these time complexities. Next, we extend our queries to support concatenations of O(1)O(1) substrings, for which the construction and query time is preserved. We apply these generalized queries to compute lexicographically minimal and maximal rotations of a given substring in constant time after linear-time preprocessing. Our data structures mainly rely on properties of Lyndon words and Lyndon factorizations. We combine them with further algorithmic and combinatorial tools, such as fusion trees and the notion of order isomorphism of strings

    Near-Optimal Computation of Runs over General Alphabet via Non-Crossing LCE Queries

    Get PDF
    Longest common extension queries (LCE queries) and runs are ubiquitous in algorithmic stringology. Linear-time algorithms computing runs and preprocessing for constant-time LCE queries have been known for over a decade. However, these algorithms assume a linearly-sortable integer alphabet. A recent breakthrough paper by Bannai et.\ al.\ (SODA 2015) showed a link between the two notions: all the runs in a string can be computed via a linear number of LCE queries. The first to consider these problems over a general ordered alphabet was Kosolobov (\emph{Inf.\ Process.\ Lett.}, 2016), who presented an O(n(logn)2/3)O(n (\log n)^{2/3})-time algorithm for answering O(n)O(n) LCE queries. This result was improved by Gawrychowski et.\ al.\ (accepted to CPM 2016) to O(nloglogn)O(n \log \log n) time. In this work we note a special \emph{non-crossing} property of LCE queries asked in the runs computation. We show that any nn such non-crossing queries can be answered on-line in O(nα(n))O(n \alpha(n)) time, which yields an O(nα(n))O(n \alpha(n))-time algorithm for computing runs

    Efficient Detection of Repeating Sites to Accelerate Phylogenetic Likelihood Calculations

    Get PDF
    The phylogenetic likelihood function (PLF) is the major computational bottleneck in several applications of evolutionary biology such as phylogenetic inference, species delimitation, model selection, and divergence times estimation. Given the alignment, a tree and the evolutionary model parameters, the likelihood function computes the conditional likelihood vectors for every node of the tree. Vector entries for which all input data are identical result in redundant likelihood operations which, in turn, yield identical conditional values. Such operations can be omitted for improving run-time and, using appropriate data structures, reducing memory usage. We present a fast, novel method for identifying and omitting such redundant operations in phylogenetic likelihood calculations, and assess the performance improvement and memory savings attained by our method. Using empirical and simulated data sets, we show that a prototype implementation of our method yields up to 12-fold speedups and uses up to 78% less memory than one of the fastest and most highly tuned implementations of the PLF currently available. Our method is generic and can seamlessly be integrated into any phylogenetic likelihood implementation
    corecore