436 research outputs found

    Parameterized searching with mismatches for run-length encoded strings

    Get PDF
    Parameterized matching between two strings occurs when it is possible to reduce the first one to the second by a renaming of the alphabet symbols. We present an algorithm for searching for parameterized occurrences of a patten in a textstring when both are given in run-length encoded form. The proposed method extends to alphabets of arbitrary yet constant size with O(| rp|×| rt|) time bounds, previously achieved only with binary alphabets. Here rp and rt denote the number of runs in the corresponding encodings for p and t. For general alphabets, the time bound obtained by the present method exhibits a polynomial dependency on the alphabet size. Such a performance is better than applying convolution to the cleartext, but leaves the problem still open of designing an alphabet independent O(| rp|×| rt|) time algorithm for this problem. © 2012 Elsevier B.V. All rights reserved

    Simrank: Rapid and sensitive general-purpose k-mer search tool

    Get PDF
    Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project (http://nihroadmap.nih.gov/hmp). Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence database partitioning, guide tree estimation, molecular classification and alignment acceleration have benefited from embedded k-mer searches as sub-routines. However, a rapid, general-purpose, open-source, flexible, stand-alone k-mer tool has not been available. Here we present a stand-alone utility, Simrank, which allows users to rapidly identify database strings the most similar to query strings. Performance testing of Simrank and related tools against DNA, RNA, protein and human-languages found Simrank 10X to 928X faster depending on the dataset. Simrank provides molecular ecologists with a high-throughput, open source choice for comparing large sequence sets to find similarity

    Parameterized Strings: Algorithms and Applications

    Get PDF
    The parameterized string (p-string), a generalization of the traditional string, is composed of constant and parameter symbols. A parameterized match (p-match) exists between two p-strings if the constants match exactly and there exists a bijection between the parameter symbols. Historically, p-strings have been employed in source code cloning, plagiarism detection, and structural similarity between biological sequences. By handling the intricacies of the parameterized suffix, we can efficiently address complex applications with data structures also reusable in traditional matching scenarios. In this dissertation, we extend data structures for p-strings (and variants) to address sophisticated string computations.;We introduce a taxonomy of classes for longest factor problems. Using this taxonomy, we show an interesting connection between the parameterized longest previous factor (pLPF) and familiar data structures in string theory, including the border array, prefix array, longest common prefix array, and analogous p-string data structures. Exploiting this connection, we construct a multitude of data structures using the same general pLPF framework.;Before this dissertation, the p-match was defined predominately by the matching between uncompressed p-strings. Here, we introduce the compressed parameterized pattern match to find all p-matches between a pattern and a text, using only the pattern and a compressed form of the text. We present parameterized compression (p-compression) as a new way to losslessly compress data to support p-matching. Experimentally, it is shown that p-compression is competitive with standard compression schemes. Using p-compression, we address the compressed p-match independent of the underlying compression routine.;Currently, p-string theory lacks the capability to support indeterminate symbols, a staple essential for applications involving inexact matching such as in music analysis. In this work, we propose and efficiently address two new types of p-matching with indeterminate symbols. (1) We introduce the indeterminate parameterized match (ip-match) to permit matching with indeterminate holes in a p-string. We support the ip-match by introducing data structures that extend the prefix array. (2) From a different perspective, the equivalence parameterized match (e-match) evolves the p-match to consider intra-alphabet symbol classes as equivalence classes. We propose a method to perform the e-match using the p-string suffix array framework, i.e. the parameterized suffix array (pSA) and parameterized longest common prefix array (pLCP). Historically, direct constructions of the pSA and pLCP have suffered from quadratic time bounds in the worst-case. Here, we introduce new p-string theory to efficiently construct the pSA/pLCP and break the theoretical worst-case time barrier.;Biological applications have become a classical use of p-string theory. Here, we introduce the structural border array to provide a lightweight solution to the biologically-oriented variant of the p-match, i.e. the structural match (s-match) on structural strings (s-strings). Following the s-match, we show how to use s-string suffix structures to support various pattern matching problems involving RNA secondary structures. Finally, we propose/construct the forward stem matrix (FSM), a data structure to access RNA stem structures, and we apply the FSM to the detection of hairpins and pseudoknots in an RNA sequence.;This dissertation advances the state-of-the-art in p-string theory by developing data structures for p-strings/s-strings and using p-string/s-string theory in new and old contexts to address various applications. Due to the flexibility of the p-string/s-string, the data structures and algorithms in this work are also applicable to the myriad of problems in the string community that involve traditional strings

    Compressibility-Aware Quantum Algorithms on Strings

    Full text link
    Sublinear time quantum algorithms have been established for many fundamental problems on strings. This work demonstrates that new, faster quantum algorithms can be designed when the string is highly compressible. We focus on two popular and theoretically significant compression algorithms -- the Lempel-Ziv77 algorithm (LZ77) and the Run-length-encoded Burrows-Wheeler Transform (RL-BWT), and obtain the results below. We first provide a quantum algorithm running in O~(zn)\tilde{O}(\sqrt{zn}) time for finding the LZ77 factorization of an input string T[1..n]T[1..n] with zz factors. Combined with multiple existing results, this yields an O~(rn)\tilde{O}(\sqrt{rn}) time quantum algorithm for finding the RL-BWT encoding with rr BWT runs. Note that r=Θ~(z)r = \tilde{\Theta}(z). We complement these results with lower bounds proving that our algorithms are optimal (up to polylog factors). Next, we study the problem of compressed indexing, where we provide a O~(rn)\tilde{O}(\sqrt{rn}) time quantum algorithm for constructing a recently designed O~(r)\tilde{O}(r) space structure with equivalent capabilities as the suffix tree. This data structure is then applied to numerous problems to obtain sublinear time quantum algorithms when the input is highly compressible. For example, we show that the longest common substring of two strings of total length nn can be computed in O~(zn)\tilde{O}(\sqrt{zn}) time, where zz is the number of factors in the LZ77 factorization of their concatenation. This beats the best known O~(n23)\tilde{O}(n^\frac{2}{3}) time quantum algorithm when zz is sufficiently small

    Counting patterns in strings and graphs

    Get PDF
    We study problems related to finding and counting patterns in strings and graphs. In the string-regime, we are interested in counting how many substring of a text are at Hamming (or edit) distance at most to a pattern . Among others, we are interested in the fully-compressed setting, where both and are given in a compressed representation. For both distance measures, we give the first algorithm that runs in (almost) linear time in the size of the compressed representations. We obtain the algorithms by new and tight structural insights into the solution structure of the problems. In the graph-regime, we study problems related to counting homomorphisms between graphs. In particular, we study the parameterized complexity of the problem #IndSub(), where we are to count all -vertex induced subgraphs of a graph that satisfy the property . Based on a theory of LovĂĄsz, Curticapean et al., we express #IndSub() as a linear combination of graph homomorphism numbers to obtain #W[1]-hardness and almost tight conditional lower bounds for properties that are monotone or that depend only on the number of edges of a graph. Thereby, we prove a conjecture by Jerrum and Meeks. In addition, we investigate the parameterized complexity of the problem #Hom(ℋ → ) for graph classes ℋ and . In particular, we show that for any problem in the class #W[1], there are classes ℋ_ and _ such that is equivalent to #Hom(ℋ_ → _ ).Wir untersuchen Probleme im Zusammenhang mit dem Finden und ZĂ€hlen von Mustern in Strings und Graphen. Im Stringbereich ist die Aufgabe, alle Teilstrings eines Strings zu bestimmen, die eine Hamming- (oder Editier-)Distanz von höchstens zu einem Pattern haben. Unter anderem sind wir am voll-komprimierten Setting interessiert, in dem sowohl , als auch in komprimierter Form gegeben sind. FĂŒr beide Abstandsbegriffe entwickeln wir die ersten Algorithmen mit einer (fast) linearen Laufzeit in der GrĂ¶ĂŸe der komprimierten Darstellungen. Die Algorithmen nutzen neue strukturelle Einsichten in die Lösungsstruktur der Probleme. Im Graphenbereich betrachten wir Probleme im Zusammenhang mit dem ZĂ€hlen von Homomorphismen zwischen Graphen. Im Besonderen betrachten wir das Problem #IndSub(), bei dem alle induzierten Subgraphen mit Knoten zu zĂ€hlen sind, die die Eigenschaft haben. Basierend auf einer Theorie von LovĂĄsz, Curticapean, Dell, and Marx drĂŒcken wir #IndSub() als Linearkombination von Homomorphismen-Zahlen aus um #W[1]-HĂ€rte und fast scharfe konditionale untere Laufzeitschranken zu erhalten fĂŒr , die monoton sind oder nur auf der Kantenanzahl der Graphen basieren. Somit beweisen wir eine Vermutung von Jerrum and Meeks. Weiterhin beschĂ€ftigen wir uns mit der KomplexitĂ€t des Problems #Hom(ℋ → ) fĂŒr Graphklassen ℋ und . Im Besonderen zeigen wir, dass es fĂŒr jedes Problem in #W[1] Graphklassen ℋ_ und _ gibt, sodass Ă€quivalent zu #Hom(ℋ_ → _ ) ist

    Enhancing computer-aided plagiarism detection

    Get PDF
    • 

    corecore