11 research outputs found

    Inferring an Indeterminate String from a Prefix Graph

    Get PDF
    An \itbf{indeterminate string} (or, more simply, just a \itbf{string}) \s{x} = \s{x}[1..n] on an alphabet Σ\Sigma is a sequence of nonempty subsets of Σ\Sigma. We say that \s{x}[i_1] and \s{x}[i_2] \itbf{match} (written \s{x}[i_1] \match \s{x}[i_2]) if and only if \s{x}[i_1] \cap \s{x}[i_2] \ne \emptyset. A \itbf{feasible array} is an array \s{y} = \s{y}[1..n] of integers such that \s{y}[1] = n and for every i∈2..ni \in 2..n, \s{y}[i] \in 0..n\- i\+ 1. A \itbf{prefix table} of a string \s{x} is an array \s{\pi} = \s{\pi}[1..n] of integers such that, for every i∈1..ni \in 1..n, \s{\pi}[i] = j if and only if \s{x}[i..i\+ j\- 1] is the longest substring at position ii of \s{x} that matches a prefix of \s{x}. It is known from \cite{CRSW13} that every feasible array is a prefix table of some indetermintate string. A \itbf{prefix graph} \mathcal{P} = \mathcal{P}_{\s{y}} is a labelled simple graph whose structure is determined by a feasible array \s{y}. In this paper we show, given a feasible array \s{y}, how to use \mathcal{P}_{\s{y}} to construct a lexicographically least indeterminate string on a minimum alphabet whose prefix table \s{\pi} = \s{y}.Comment: 13 pages, 1 figur

    Computing Covers Using Prefix Tables

    Get PDF
    An \emph{indeterminate string} x=x[1..n]x = x[1..n] on an alphabet Σ\Sigma is a sequence of nonempty subsets of Σ\Sigma; xx is said to be \emph{regular} if every subset is of size one. A proper substring uu of regular xx is said to be a \emph{cover} of xx iff for every i∈1..ni \in 1..n, an occurrence of uu in xx includes x[i]x[i]. The \emph{cover array} γ=γ[1..n]\gamma = \gamma[1..n] of xx is an integer array such that γ[i]\gamma[i] is the longest cover of x[1..i]x[1..i]. Fifteen years ago a complex, though nevertheless linear-time, algorithm was proposed to compute the cover array of regular xx based on prior computation of the border array of xx. In this paper we first describe a linear-time algorithm to compute the cover array of regular string xx based on the prefix table of xx. We then extend this result to indeterminate strings.Comment: 14 pages, 1 figur

    Truly Subquadratic-Time Extension Queries and Periodicity Detection in Strings with Uncertainties

    Get PDF
    Strings with don\u27t care symbols, also called partial words, and more general indeterminate strings are a natural representation of strings containing uncertain symbols. A considerable effort has been made to obtain efficient algorithms for pattern matching and periodicity detection in such strings. Among those, a number of algorithms have been proposed that behave well on random data, but still their worst-case running time is Theta(n^2). We present the first truly subquadratic-time solutions for a number of such problems on partial words that can also be adapted to indeterminate strings over a constant-sized alphabet. We show that nn longest common compatible prefix queries (which correspond to longest common extension queries in regular strings) can be answered on-line in O(n * sqrt(n * log(n)) time after O(n * sqrt(n * log(n))-time preprocessing. We also present O(n * sqrt(n * log(n))-time algorithms for computing the prefix array and two types of border array of a partial word

    New Bounds and Extended Relations Between Prefix Arrays, Border Arrays, Undirected Graphs, and Indeterminate Strings

    Get PDF
    We extend earlier works on the relation of prefix arrays of indeterminate strings to undirected graphs and border arrays. If integer array y is the prefix array for indeterminate string w, then we say w satisfies y. We use a graph theoretic approach to construct a string on a minimum alphabet size which satisfies a given prefix array. We relate the problem of finding a minimum alphabet size to finding edge clique covers of a particular graph, allowing us to bound the minimum alphabet size by n plus square root of n for indeterminate strings, where n is the size of the prefix array. When we restrict ourselves to prefix arrays for partial words, we bound the minimum alphabet size by sqrt(2n). Moreover, we show that this bound is tight up to a constant multiple by using Sidon sets. We also study the relationship between prefix arrays and border arrays. We show that the slowly-increasing property completely characterizes border arrays for indeterminate strings, whence there are exactly C_n distinct border arrays of size nn for indeterminate strings (here C_n is the nth Catalan number). We also bound the number of prefix arrays for partial words of a given size using Stirling numbers of the second kind

    Enhanced covers of regular & indeterminate strings using prefix tables

    Get PDF
    A \itbf{cover} of a string x=x[1..n] is a proper substring u of x such that x can be constructed from possibly overlapping instances of u. A recent paper \cite{FIKPPST13} relaxes this definition --- an \itbf{enhanced cover} u of x is a border of x (that is, a proper prefix that is also a suffix) that covers a {\it maximum} number of positions in x (not necessarily all) --- and proposes efficient algorithms for the computation of enhanced covers. These algorithms depend on the prior computation of the \itbf{border array} β[1..n], where β[i] is the length of the longest border of x[1..i], 1≤i≤n. In this paper, we first show how to compute enhanced covers using instead the \itbf{prefix table}: an array π[1..n] such that π[i] is the length of the longest substring of x beginning at position i that matches a prefix of x. Unlike the border array, the prefix table is robust: its properties hold also for \itbf{indeterminate strings} --- that is, strings defined on {\it subsets} of the alphabet Σ rather than individual elements of Σ. Thus, our algorithms, in addition to being faster in practice and more space-efficient than those of \cite{FIKPPST13}, allow us to easily extend the computation of enhanced covers to indeterminate strings. Both for regular and indeterminate strings, our algorithms execute in expected linear time. Along the way we establish an important theoretical result: that the expected maximum length of any border of any prefix of a regular string x is approximately 1.64 for binary alphabets, less for larger one

    String Covering: A Survey

    Full text link
    The study of strings is an important combinatorial field that precedes the digital computer. Strings can be very long, trillions of letters, so it is important to find compact representations. Here we first survey various forms of one potential compaction methodology, the cover of a given string x, initially proposed in a simple form in 1990, but increasingly of interest as more sophisticated variants have been discovered. We then consider covering by a seed; that is, a cover of a superstring of x. We conclude with many proposals for research directions that could make significant contributions to string processing in future

    Parameterized Strings: Algorithms and Applications

    Get PDF
    The parameterized string (p-string), a generalization of the traditional string, is composed of constant and parameter symbols. A parameterized match (p-match) exists between two p-strings if the constants match exactly and there exists a bijection between the parameter symbols. Historically, p-strings have been employed in source code cloning, plagiarism detection, and structural similarity between biological sequences. By handling the intricacies of the parameterized suffix, we can efficiently address complex applications with data structures also reusable in traditional matching scenarios. In this dissertation, we extend data structures for p-strings (and variants) to address sophisticated string computations.;We introduce a taxonomy of classes for longest factor problems. Using this taxonomy, we show an interesting connection between the parameterized longest previous factor (pLPF) and familiar data structures in string theory, including the border array, prefix array, longest common prefix array, and analogous p-string data structures. Exploiting this connection, we construct a multitude of data structures using the same general pLPF framework.;Before this dissertation, the p-match was defined predominately by the matching between uncompressed p-strings. Here, we introduce the compressed parameterized pattern match to find all p-matches between a pattern and a text, using only the pattern and a compressed form of the text. We present parameterized compression (p-compression) as a new way to losslessly compress data to support p-matching. Experimentally, it is shown that p-compression is competitive with standard compression schemes. Using p-compression, we address the compressed p-match independent of the underlying compression routine.;Currently, p-string theory lacks the capability to support indeterminate symbols, a staple essential for applications involving inexact matching such as in music analysis. In this work, we propose and efficiently address two new types of p-matching with indeterminate symbols. (1) We introduce the indeterminate parameterized match (ip-match) to permit matching with indeterminate holes in a p-string. We support the ip-match by introducing data structures that extend the prefix array. (2) From a different perspective, the equivalence parameterized match (e-match) evolves the p-match to consider intra-alphabet symbol classes as equivalence classes. We propose a method to perform the e-match using the p-string suffix array framework, i.e. the parameterized suffix array (pSA) and parameterized longest common prefix array (pLCP). Historically, direct constructions of the pSA and pLCP have suffered from quadratic time bounds in the worst-case. Here, we introduce new p-string theory to efficiently construct the pSA/pLCP and break the theoretical worst-case time barrier.;Biological applications have become a classical use of p-string theory. Here, we introduce the structural border array to provide a lightweight solution to the biologically-oriented variant of the p-match, i.e. the structural match (s-match) on structural strings (s-strings). Following the s-match, we show how to use s-string suffix structures to support various pattern matching problems involving RNA secondary structures. Finally, we propose/construct the forward stem matrix (FSM), a data structure to access RNA stem structures, and we apply the FSM to the detection of hairpins and pseudoknots in an RNA sequence.;This dissertation advances the state-of-the-art in p-string theory by developing data structures for p-strings/s-strings and using p-string/s-string theory in new and old contexts to address various applications. Due to the flexibility of the p-string/s-string, the data structures and algorithms in this work are also applicable to the myriad of problems in the string community that involve traditional strings

    New perspectives on the prefix array

    No full text
    corecore