4,212 research outputs found

    Matching Subsequences in Trees

    Full text link
    Given two rooted, labeled trees PP and TT the tree path subsequence problem is to determine which paths in PP are subsequences of which paths in TT. Here a path begins at the root and ends at a leaf. In this paper we propose this problem as a useful query primitive for XML data, and provide new algorithms improving the previously best known time and space bounds.Comment: Minor correction of typos, et

    Compressed Spaced Suffix Arrays

    Full text link
    Spaced seeds are important tools for similarity search in bioinformatics, and using several seeds together often significantly improves their performance. With existing approaches, however, for each seed we keep a separate linear-size data structure, either a hash table or a spaced suffix array (SSA). In this paper we show how to compress SSAs relative to normal suffix arrays (SAs) and still support fast random access to them. We first prove a theoretical upper bound on the space needed to store an SSA when we already have the SA. We then present experiments indicating that our approach works even better in practice

    LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations

    Full text link
    LRM-Trees are an elegant way to partition a sequence of values into sorted consecutive blocks, and to express the relative position of the first element of each block within a previous block. They were used to encode ordinal trees and to index integer arrays in order to support range minimum queries on them. We describe how they yield many other convenient results in a variety of areas, from data structures to algorithms: some compressed succinct indices for range minimum queries; a new adaptive sorting algorithm; and a compressed succinct data structure for permutations supporting direct and indirect application in time all the shortest as the permutation is compressible.Comment: 13 pages, 1 figur

    Computer Aided Simulation of DNA Fingerprint Amplified Fragment Length Polymophism (AFLP) Using Suffix Tree Indexing and Data Mining

    Get PDF
    AFLP is one of the DNA Fingerprinting techniques which have broad application as genetic marker in various fields. Begin with the DNA sequence digestion using one or more particular restriction enzyme, ligation of the adapters to the overhanging sticky ends followed by DNA fragments amplification using PCR. The PCR reaction uses primers that match the adapter sequence and have some (1 to 3) dditional “selective” bases which could be any bases, this reduces the number of bands that will be amplified. Such technique intended to increase the amplified fragments peculiarity so the polymorphism of the organism being studied could be well visualized by gel electrophoresis. The computer aided of AFLP simulation developed in this research was aimed to predict this electrophoresis result by simulate the digestion, ligation and PCR process using some pattern recognition algorithm applied to the DNA sequence from online databases. Through this simulation the researcher could determine the best combination of restriction enzyme and selective bases for their laboratory experiment. Suffix tree indexing was conducted during the exploration process of the genome sequence (in FASTA format) to find the restriction sites rapidly and create fragments of it. Data modeling enable the system draws the fragments into virtual DNA’s electrophoresis pattern. Data mining accomplish the simulation by exploring overall possible virtual DNA’s electrophoresis pattern and determine the best restriction enzyme and selective bases combination by calculating certain quantitative criteria

    Faster Approximate String Matching for Short Patterns

    Full text link
    We study the classical approximate string matching problem, that is, given strings PP and QQ and an error threshold kk, find all ending positions of substrings of QQ whose edit distance to PP is at most kk. Let PP and QQ have lengths mm and nn, respectively. On a standard unit-cost word RAM with word size wlognw \geq \log n we present an algorithm using time O(nkmin(log2mlogn,log2mlogww)+n) O(nk \cdot \min(\frac{\log^2 m}{\log n},\frac{\log^2 m\log w}{w}) + n) When PP is short, namely, m=2o(logn)m = 2^{o(\sqrt{\log n})} or m=2o(w/logw)m = 2^{o(\sqrt{w/\log w})} this improves the previously best known time bounds for the problem. The result is achieved using a novel implementation of the Landau-Vishkin algorithm based on tabulation and word-level parallelism.Comment: To appear in Theory of Computing System
    corecore