Search CORE

18 research outputs found

Heaviest Induced Ancestors and Longest Common Substrings

Author: Gagie Travis
Gawrychowski Paweł
Nekrich Yakov
Publication venue
Publication date: 01/01/2013
Field of study

Suppose we have two trees on the same set of leaves, in which nodes are weighted such that children are heavier than their parents. We say a node from the first tree and a node from the second tree are induced together if they have a common leaf descendant. In this paper we describe data structures that efficiently support the following heaviest-induced-ancestor query: given a node from the first tree and a node from the second tree, find an induced pair of their ancestors with maximum combined weight. Our solutions are based on a geometric interpretation that enables us to find heaviest induced ancestors using range queries. We then show how to use these results to build an LZ-compressed index with which we can quickly find with high probability a longest substring common to the indexed string and a given pattern

arXiv.org e-Print Archive

CiteSeerX

MPG.PuRe

Upper and lower bounds for dynamic data structures on strings

Author: Clifford Raphael
Grønlund Allan
Larsen Kasper Green
Starikovskaya Tatiana
Publication venue
Publication date: 01/01/2018
Field of study

We consider a range of simply stated dynamic data structure problems on strings. An update changes one symbol in the input and a query asks us to compute some function of the pattern of length

m

and a substring of a longer text. We give both conditional and unconditional lower bounds for variants of exact matching with wildcards, inner product, and Hamming distance computation via a sequence of reductions. As an example, we show that there does not exist an

O(m^{1/2-\varepsilon})

time algorithm for a large range of these problems unless the online Boolean matrix-vector multiplication conjecture is false. We also provide nearly matching upper bounds for most of the problems we consider.Comment: Accepted at STACS'1

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Explore Bristol Research

Optimal Prefix and Suffix Queries on Texts

Author: Crochemore Maxime
Iliopoulos Costas S.
Rahman Mohammad Sohel
Publication venue: 'Elsevier BV'
Publication date: 01/01/2008
Field of study

International audienceIn this paper, we study a restricted version of the position restricted pattern matching problem introduced and studied by Makinen and Navarro [V. Makinen, G. Navarro, Position-restricted substring searching, in: J.R. Correa, A. Hevia, M.A. Kiwi (Eds.), LATIN, in: Lecture Notes in Computer Science, vol. 3887, Springer, 2006, pp. 703-714]. In the problem handled in this paper, we are interested in those occurrences of the pattern that lies in a suffix or in a prefix of the given text. We achieve optimal query time for our problem against a data structure which is an extension of the classic suffix tree data structure. The time and space complexity of the data structure is dominated by that of the suffix tree. Notably, the (best) algorithm by Makinen and Navarro, if applied to our problem, gives sub-optimal query time and the corresponding data structure also requires more time and space

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Mind the Gap: Essentially Optimal Algorithms for Online Dictionary Matching with One Gap

Author: Amir Amihood
Kopelowitz Tsvi
Levy Avivit
Pettie Seth
Porat Ely
Shalom B. Riva
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 27th International Symposium on Algorithms and Computation (ISAAC 2016)
Publication date: 01/01/2016
Field of study

We examine the complexity of the online Dictionary Matching with One Gap Problem (DMOG) which is the following. Preprocess a dictionary D of d patterns, where each pattern contains a special gap symbol that can match any string, so that given a text that arrives online, a character at a time, we can report all of the patterns from D that are suffixes of the text that has arrived so far, before the next character arrives. In more general versions the gap symbols are associated with bounds determining the possible lengths of matching strings. Online DMOG captures the difficulty in a bottleneck procedure for cyber-security, as many digital signatures of viruses manifest themselves as patterns with a single gap. In this paper, we demonstrate that the difficulty in obtaining efficient solutions for the DMOG problem, even in the offline setting, can be traced back to the infamous 3SUM conjecture. We show a conditional lower bound of Omega(delta(G_D)+op) time per text character, where G_D is a bipartite graph that captures the structure of D, delta(G_D) is the degeneracy of this graph, and op is the output size. Moreover, we show a conditional lower bound in terms of the magnitude of gaps for the bounded case, thereby showing that some known offline upper bounds are essentially optimal. We also provide matching upper-bounds (up to sub-polynomial factors), in terms of the degeneracy, for the online DMOG problem. In particular, we introduce algorithms whose time cost depends linearly on delta(G_D). Our algorithms make use of graph orientations, together with some additional techniques. These algorithms are of practical interest since although delta(G_D) can be as large as sqrt(d), and even larger if G_D is a multi-graph, it is typically a very small constant in practice. Finally, when delta(G_D) is large we are able to obtain even more efficient solutions

Dagstuhl Research Online Publication Server

Succinct Online Dictionary Matching with Improved Worst-Case Guarantees

Author: Kopelowitz Tsvi
Porat Ely
Rozen Yaron
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016)
Publication date: 01/01/2016
Field of study

In the online dictionary matching problem the goal is to preprocess a set of patterns D={P_1,...,P_d} over alphabet Sigma, so that given an online text (one character at a time) we report all of the occurrences of patterns that are a suffix of the current text before the following character arrives. We introduce a succinct Aho-Corasick like data structure for the online dictionary matching problem. Our solution uses a new succinct representation for multi-labeled trees, in which each node has a set of labels from a universe of size lambda. We consider lowest labeled ancestor (LLA) queries on multi-labeled trees, where given a node and a label we return the lowest proper ancestor of the node that has the queried label. In this paper we introduce a succinct representation of multi-labeled trees for lambda=omega(1) that support LLA queries in O(log(log(lambda))) time. Using this representation of multi-labeled trees, we introduce a succinct data structure for the online dictionary matching problem when sigma=omega(1). In this solution the worst case cost per character is O(log(log(sigma)) + occ) time, where occ is the size of the current output. Moreover, the amortized cost per character is O(1+occ) time

Dagstuhl Research Online Publication Server

Upper and Lower Bounds for Dynamic Data Structures on Strings

Author: Larsen Kasper Green
Starikovskaya Tatiana
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 35th Symposium on Theoretical Aspects of Computer Science (STACS 2018)
Publication date: 01/01/2018
Field of study

We consider a range of simply stated dynamic data structure problems on strings. An update changes one symbol in the input and a query asks us to compute some function of the pattern of length m and a substring of a longer text. We give both conditional and unconditional lower bounds for variants of exact matching with wildcards, inner product, and Hamming distance computation via a sequence of reductions. As an example, we show that there does not exist an O(m^{1/2-epsilon}) time algorithm for a large range of these problems unless the online Boolean matrix-vector multiplication conjecture is false. We also provide nearly matching upper bounds for most of the problems we consider

Dagstuhl Research Online Publication Server

Cache-oblivious index for approximate string matching

Author: Hon WK
Lam TW
Shah R
Tam SL
Vitter JS
Publication venue: 'Elsevier BV'
Publication date: 01/01/2011
Field of study

This paper revisits the problem of indexing a text for approximate string matching. Specifically, given a text T of length n and a positive integer k, we want to construct an index of T such that for any input pattern P, we can find all its k-error matches in T efficiently. This problem is well-studied in the internal-memory setting. Here, we extend some of these recent results to external-memory solutions, which are also cache-oblivious. Our first index occupies O((nlog kn)B) disk pages and finds all k-error matches with O((|P|+occ)B+log knloglog Bn) I/Os, where B denotes the number of words in a disk page. To the best of our knowledge, this index is the first external-memory data structure that does not require Ω (|P|+occ+poly(logn)) I/Os. The second index reduces the space to O((nlogn)B) disk pages, and the I/O complexity is O((|P|+occ)B+log k(k+1)nloglogn) . © 2011 Elsevier B.V. All rights reserved.postprin

Elsevier - Publisher Connector

HKU Scholars Hub

Elastic-Degenerate String Matching with 1 Error

Author: Bernardini Giulia
Gabory Estéban
Pissis Solon P.
Stougie Leen
Sweering Michelle
Zuba Wiktor
Publication venue
Publication date: 01/01/2022
Field of study

An elastic-degenerate string is a sequence of

n

finite sets of strings of total length

N

, introduced to represent a set of related DNA sequences, also known as a pangenome. The ED string matching (EDSM) problem consists in reporting all occurrences of a pattern of length

m

in an ED text. This problem has recently received some attention by the combinatorial pattern matching community, culminating in an

\tilde{\mathcal{O}}(nm^{\omega-1})+\mathcal{O}(N)

-time algorithm [Bernardini et al., SIAM J. Comput. 2022], where

\omega

denotes the matrix multiplication exponent and the

\tilde{\mathcal{O}}(\cdot)

notation suppresses polylog factors. In the

k

-EDSM problem, the approximate version of EDSM, we are asked to report all pattern occurrences with at most

k

errors.

k

-EDSM can be solved in

\mathcal{O}(k^2mG+kN)

time, under edit distance, or

\mathcal{O}(kmG+kN)

time, under Hamming distance, where

G

denotes the total number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020]. Unfortunately,

G

is only bounded by

N

, and so even for

k=1

, the existing algorithms run in

\Omega(mN)

time in the worst case. In this paper we show that

1

-EDSM can be solved in

\mathcal{O}((nm^2 + N)\log m)

\mathcal{O}(nm^3 + N)

time under edit distance. For the decision version, we present a faster

\mathcal{O}(nm^2\sqrt{\log m} + N\log\log m)

-time algorithm. We also show that

1

-EDSM can be solved in

\mathcal{O}(nm^2 + N\log m)

time under Hamming distance. Our algorithms for edit distance rely on non-trivial reductions from

1

-EDSM to special instances of classic computational geometry problems (2d rectangle stabbing or 2d range emptiness), which we show how to solve efficiently. In order to obtain an even faster algorithm for Hamming distance, we rely on employing and adapting the

k

-errata trees for indexing with errors [Cole et al., STOC 2004].Comment: This is an extended version of a paper accepted at LATIN 202

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Trieste

VU Research Portal

CWI's Institutional Repository

INRIA a CCSD electronic archive server

A computational study of off-target effects of RNA interference

Author: Adema Coen M.
Lane Terran
Qiu Shibin
Publication venue: Oxford University Press
Publication date: 01/01/2005
Field of study

RNA interference (RNAi) is an intracellular mechanism for post-transcriptional gene silencing that is frequently used to study gene function. RNAi is initiated by short interfering RNA (siRNA) of ∼21 nt in length, either generated from the double-stranded RNA (dsRNA) by using the enzyme Dicer or introduced experimentally. Following association with an RNAi silencing complex, siRNA targets mRNA transcripts that have sequence identity for destruction. A phenotype resulting from this knockdown of expression may inform about the function of the targeted gene. However, ‘off-target effects’ compromise the specificity of RNAi if sequence identity between siRNA and random mRNA transcripts causes RNAi to knockdown expression of non-targeted genes. The complete off-target effects must be investigated systematically on each gene in a genome by adjusting a group of parameters, which is too expensive to conduct experimentally and motivates a study in silico. This computational study examined the potential for off-target effects of RNAi, employing the genome and transcriptome sequence data of Homo sapiens, Caenorhabditis elegans and Schizosaccharomyces pombe. The chance for RNAi off-target effects proved considerable, ranging from 5 to 80% for each of the organisms, when using as parameter the exact identity between any possible siRNA sequences (arbitrary length ranging from 17 to 28 nt) derived from a dsRNA (range 100–400 nt) representing the coding sequences of target genes and all other siRNAs within the genome. Remarkably, high-sequence specificity and low probability for off-target reactivity were optimally balanced for siRNA of 21 nt, the length observed mostly in vivo. The chance for off-target RNAi increased (although not always significantly) with greater length of the initial dsRNA sequence, inclusion into the analysis of available untranslated region sequences and allowing for mismatches between siRNA and target sequences. siRNA sequences from within 100 nt of the 5′ termini of coding sequences had low chances for off-target reactivity. This may be owing to coding constraints for signal peptide-encoding regions of genes relative to regions that encode for mature proteins. Off-target distribution varied along the chromosomes of C.elegans, apparently owing to the use of more unique sequences in gene-dense regions. Finally, biological and thermodynamical descriptors of effective siRNA reduced the number of potential siRNAs compared with those identified by sequence identity alone, but off-target RNAi remained likely, with an off-target error rate of ∼10%. These results also suggest a direction for future in vivo studies that could both help in calibrating true off-target rates in living organisms and also in contributing evidence toward the debate of whether siRNA efficacy is correlated with, or independent of, the target molecule. In summary, off-target effects present a real but not prohibitive concern that should be considered for RNAi experiments

CiteSeerX

Crossref

PubMed Central