Search CORE

436 research outputs found

Parameterized searching with mismatches for run-length encoded strings

Author: Ahuja
Alberto Apostolico
Alpár Jüttner
Amir
Apostolico
Baker
Baker
Fredman
Hazay
Péter L. Erdős
Radzik
Publication venue: 'Elsevier BV'
Publication date: 01/01/2012
Field of study

Parameterized matching between two strings occurs when it is possible to reduce the first one to the second by a renaming of the alphabet symbols. We present an algorithm for searching for parameterized occurrences of a patten in a textstring when both are given in run-length encoded form. The proposed method extends to alphabets of arbitrary yet constant size with O(| rp|×| rt|) time bounds, previously achieved only with binary alphabets. Here rp and rt denote the number of runs in the corresponding encodings for p and t. For general alphabets, the time bound obtained by the present method exhibits a polynomial dependency on the alphabet size. Such a performance is better than applying convolution to the cleartext, but leaves the problem still open of designing an alphabet independent O(| rp|×| rt|) time algorithm for this problem. © 2012 Elsevier B.V. All rights reserved

Crossref

Repository of the Academy's Library

Parameterized searching with mismatches for run-length encoded strings (extended abstract)

Author: A. Amir
A. Apostolico
B.S. Baker
B.S. Baker
M.L. Fredman
R.K. Ahuja
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Crossref

Repository of the Academy's Library

Simrank: Rapid and sensitive general-purpose k-mer search tool

Author: Alekseyenko Alexander V
Andersen Gary L
Brodie Eoin L
DeSantis Todd Z
Karaoz Ulas
Keller Keith
Larsen Niels
Pei Zhiheng
Singh Navjeet NS
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project (http://nihroadmap.nih.gov/hmp). Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence database partitioning, guide tree estimation, molecular classification and alignment acceleration have benefited from embedded k-mer searches as sub-routines. However, a rapid, general-purpose, open-source, flexible, stand-alone k-mer tool has not been available. Here we present a stand-alone utility, Simrank, which allows users to rapidly identify database strings the most similar to query strings. Performance testing of Simrank and related tools against DNA, RNA, protein and human-languages found Simrank 10X to 928X faster depending on the dataset. Simrank provides molecular ecologists with a high-throughput, open source choice for comparing large sequence sets to find similarity

Crossref

Springer - Publisher Connector

PubMed Central

eScholarship - University of California

UNT Digital Library

Parameterized Strings: Algorithms and Applications

Author: Beal Richard
Publication venue: The Research Repository @ WVU
Publication date: 01/01/2015
Field of study

The parameterized string (p-string), a generalization of the traditional string, is composed of constant and parameter symbols. A parameterized match (p-match) exists between two p-strings if the constants match exactly and there exists a bijection between the parameter symbols. Historically, p-strings have been employed in source code cloning, plagiarism detection, and structural similarity between biological sequences. By handling the intricacies of the parameterized suffix, we can efficiently address complex applications with data structures also reusable in traditional matching scenarios. In this dissertation, we extend data structures for p-strings (and variants) to address sophisticated string computations.;We introduce a taxonomy of classes for longest factor problems. Using this taxonomy, we show an interesting connection between the parameterized longest previous factor (pLPF) and familiar data structures in string theory, including the border array, prefix array, longest common prefix array, and analogous p-string data structures. Exploiting this connection, we construct a multitude of data structures using the same general pLPF framework.;Before this dissertation, the p-match was defined predominately by the matching between uncompressed p-strings. Here, we introduce the compressed parameterized pattern match to find all p-matches between a pattern and a text, using only the pattern and a compressed form of the text. We present parameterized compression (p-compression) as a new way to losslessly compress data to support p-matching. Experimentally, it is shown that p-compression is competitive with standard compression schemes. Using p-compression, we address the compressed p-match independent of the underlying compression routine.;Currently, p-string theory lacks the capability to support indeterminate symbols, a staple essential for applications involving inexact matching such as in music analysis. In this work, we propose and efficiently address two new types of p-matching with indeterminate symbols. (1) We introduce the indeterminate parameterized match (ip-match) to permit matching with indeterminate holes in a p-string. We support the ip-match by introducing data structures that extend the prefix array. (2) From a different perspective, the equivalence parameterized match (e-match) evolves the p-match to consider intra-alphabet symbol classes as equivalence classes. We propose a method to perform the e-match using the p-string suffix array framework, i.e. the parameterized suffix array (pSA) and parameterized longest common prefix array (pLCP). Historically, direct constructions of the pSA and pLCP have suffered from quadratic time bounds in the worst-case. Here, we introduce new p-string theory to efficiently construct the pSA/pLCP and break the theoretical worst-case time barrier.;Biological applications have become a classical use of p-string theory. Here, we introduce the structural border array to provide a lightweight solution to the biologically-oriented variant of the p-match, i.e. the structural match (s-match) on structural strings (s-strings). Following the s-match, we show how to use s-string suffix structures to support various pattern matching problems involving RNA secondary structures. Finally, we propose/construct the forward stem matrix (FSM), a data structure to access RNA stem structures, and we apply the FSM to the detection of hairpins and pseudoknots in an RNA sequence.;This dissertation advances the state-of-the-art in p-string theory by developing data structures for p-strings/s-strings and using p-string/s-string theory in new and old contexts to address various applications. Due to the flexibility of the p-string/s-string, the data structures and algorithms in this work are also applicable to the myriad of problems in the string community that involve traditional strings

The Research Repository @ WVU (West Virginia University)

Compressibility-Aware Quantum Algorithms on Strings

Author: Gibney Daniel
Thankachan Sharma V.
Publication venue
Publication date: 14/02/2023
Field of study

Sublinear time quantum algorithms have been established for many fundamental problems on strings. This work demonstrates that new, faster quantum algorithms can be designed when the string is highly compressible. We focus on two popular and theoretically significant compression algorithms -- the Lempel-Ziv77 algorithm (LZ77) and the Run-length-encoded Burrows-Wheeler Transform (RL-BWT), and obtain the results below. We first provide a quantum algorithm running in

\tilde{O}(\sqrt{zn})

time for finding the LZ77 factorization of an input string

T[1..n]

with

z

factors. Combined with multiple existing results, this yields an

\tilde{O}(\sqrt{rn})

time quantum algorithm for finding the RL-BWT encoding with

r

BWT runs. Note that

r = \tilde{\Theta}(z)

. We complement these results with lower bounds proving that our algorithms are optimal (up to polylog factors). Next, we study the problem of compressed indexing, where we provide a

\tilde{O}(\sqrt{rn})

time quantum algorithm for constructing a recently designed

\tilde{O}(r)

space structure with equivalent capabilities as the suffix tree. This data structure is then applied to numerous problems to obtain sublinear time quantum algorithms when the input is highly compressible. For example, we show that the longest common substring of two strings of total length

n

can be computed in

\tilde{O}(\sqrt{zn})

time, where

z

is the number of factors in the LZ77 factorization of their concatenation. This beats the best known

\tilde{O}(n^\frac{2}{3})

time quantum algorithm when

z

is sufficiently small

arXiv.org e-Print Archive

Counting patterns in strings and graphs

Author: Wellnitz Philip
Publication venue: Saarländische Universitäts- und Landesbibliothek
Publication date: 01/01/2021
Field of study

We study problems related to finding and counting patterns in strings and graphs. In the string-regime, we are interested in counting how many substring of a text are at Hamming (or edit) distance at most to a pattern . Among others, we are interested in the fully-compressed setting, where both and are given in a compressed representation. For both distance measures, we give the first algorithm that runs in (almost) linear time in the size of the compressed representations. We obtain the algorithms by new and tight structural insights into the solution structure of the problems. In the graph-regime, we study problems related to counting homomorphisms between graphs. In particular, we study the parameterized complexity of the problem #IndSub(), where we are to count all -vertex induced subgraphs of a graph that satisfy the property . Based on a theory of Lovász, Curticapean et al., we express #IndSub() as a linear combination of graph homomorphism numbers to obtain #W[1]-hardness and almost tight conditional lower bounds for properties that are monotone or that depend only on the number of edges of a graph. Thereby, we prove a conjecture by Jerrum and Meeks. In addition, we investigate the parameterized complexity of the problem #Hom(ℋ → ) for graph classes ℋ and . In particular, we show that for any problem in the class #W[1], there are classes ℋ_ and _ such that is equivalent to #Hom(ℋ_ → _ ).Wir untersuchen Probleme im Zusammenhang mit dem Finden und Zählen von Mustern in Strings und Graphen. Im Stringbereich ist die Aufgabe, alle Teilstrings eines Strings zu bestimmen, die eine Hamming- (oder Editier-)Distanz von höchstens zu einem Pattern haben. Unter anderem sind wir am voll-komprimierten Setting interessiert, in dem sowohl , als auch in komprimierter Form gegeben sind. Für beide Abstandsbegriffe entwickeln wir die ersten Algorithmen mit einer (fast) linearen Laufzeit in der Größe der komprimierten Darstellungen. Die Algorithmen nutzen neue strukturelle Einsichten in die Lösungsstruktur der Probleme. Im Graphenbereich betrachten wir Probleme im Zusammenhang mit dem Zählen von Homomorphismen zwischen Graphen. Im Besonderen betrachten wir das Problem #IndSub(), bei dem alle induzierten Subgraphen mit Knoten zu zählen sind, die die Eigenschaft haben. Basierend auf einer Theorie von Lovász, Curticapean, Dell, and Marx drücken wir #IndSub() als Linearkombination von Homomorphismen-Zahlen aus um #W[1]-Härte und fast scharfe konditionale untere Laufzeitschranken zu erhalten für , die monoton sind oder nur auf der Kantenanzahl der Graphen basieren. Somit beweisen wir eine Vermutung von Jerrum and Meeks. Weiterhin beschäftigen wir uns mit der Komplexität des Problems #Hom(ℋ → ) für Graphklassen ℋ und . Im Besonderen zeigen wir, dass es für jedes Problem in #W[1] Graphklassen ℋ_ und _ gibt, sodass äquivalent zu #Hom(ℋ_ → _ ) ist

Universaar

Acronym

MPG.PuRe

Enhancing computer-aided plagiarism detection

Author: Mozgovoy Maxim
Publication venue: University of Joensuu
Publication date
Field of study

UEF Electronic Publications