Search CORE

1,117 research outputs found

Bounded prefix-suffix duplication

Author: A. Ehrenfeucht
D. Gusfield
D. Knuth
D.B. Searls
D.P. Bovet
J. Dassow
J. Kärkkäinen
M. Crochemore
M. Frazier
M.-W. Wang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

We consider a restricted variant of the prefix-suffix duplication operation, called bounded prefix-suffix duplication. It consists in the iterative duplication of a prefix or suffix, whose length is bounded by a constant, of a given word. We give a sufficient condition for the closure under bounded prefix-suffix duplication of a class of languages. Consequently, the class of regular languages is closed under bounded prefix-suffix duplication; furthermore, we propose an algorithm deciding whether a regular language is a finite k-prefix-suffix duplication language. An efficient algorithm solving the membership problem for the k-prefix-suffix duplication of a language is also presented. Finally, we define the k-prefix-suffix duplication distance between two words, extend it to languages and show how it can be computed for regular languages

Crossref

Archivo Digital UPM

Ten Conferences WORDS: Open Problems and Conjectures

Author: Néraud Jean
Publication venue
Publication date: 10/06/2016
Field of study

In connection to the development of the field of Combinatorics on Words, we present a list of open problems and conjectures that were stated during the ten last meetings WORDS. We wish to continually update the present document by adding informations concerning advances in problems solving

arXiv.org e-Print Archive

HAL - Normandie Université

Detecting Breakage Fusion Bridge cycles in tumor genomes -- an algorithmic approach

Author: Bafna Vineet
Kinsella Marcus
Zakov Shay
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 11/01/2013
Field of study

Breakage-Fusion-Bridge (BFB) is a mechanism of genomic instability characterized by the joining and subsequent tearing apart of sister chromatids. When this process is repeated during multiple rounds of cell division, it leads to patterns of copy number increases of chromosomal segments as well as fold-back inversions where duplicated segments are arranged head-to-head. These structural variations can then drive tumorigenesis. BFB can be observed in progress using cytogenetic techniques, but generally BFB must be inferred from data like microarrays or sequencing collected after BFB has ceased. Making correct inferences from this data is not straightforward, particularly given the complexity of some cancer genomes and BFB's ability to generate a wide range of rearrangement patterns. Here we present algorithms to aid the interpretation of evidence for BFB. We first pose the BFB count vector problem: given a chromosome segmentation and segment copy numbers, decide whether BFB can yield a chromosome with the given segment counts. We present the first linear-time algorithm for the problem, improving a previous exponential-time algorithm. We then combine this algorithm with fold-back inversions to develop tests for BFB. We show that, contingent on assumptions about cancer genome evolution, count vectors and fold-back inversions are sufficient evidence for detecting BFB. We apply the presented techniques to paired-end sequencing data from pancreatic tumors and confirm a previous finding of BFB as well as identify a new chromosomal region likely rearranged by BFB cycles, demonstrating the practicality of our approach

arXiv.org e-Print Archive

CiteSeerX

Prefix-suffix duplication

Author: García López de Lacalle Jesús
Manea Florin
Mitrana Victor
Publication venue: 'Elsevier BV'
Publication date: 01/01/2014
Field of study

We consider a bio-inspired formal operation on words called prefix-suffix duplication which consists in the duplication of a prefix or suffix of a given word. The class of languages defined by the iterated application of the prefix-suffix duplication to a word is considered. We show that such a language is context-free if and only if the initial word contains just one letter. Moreover, every language in this class is semilinear and belongs to NL. We propose a 0(n2 logn) time and 0(n2 ) space recognition algorithm. Two algorithms are further proposed for computing the prefix-suffix duplication distance between two words, defined as the minimal number of prefix-suffix duplications applied to one of them in order to get the other one. The first algorithm runs in cubic time and uses quadratic space while the second one is more efficient, having 0(n2 logn) time complexity, but needs 0(n2 logn) space

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital UPM

The Tandem Duplication Distance Is NP-Hard

Author: Lafond Manuel
Zhu Binhai
Zou Peng
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020)
Publication date: 12/06/2019
Field of study

In computational biology, tandem duplication is an important biological phenomenon which can occur either at the genome or at the DNA level. A tandem duplication takes a copy of a genome segment and inserts it right after the segment - this can be represented as the string operation AXB ? AXXB. Tandem exon duplications have been found in many species such as human, fly or worm, and have been largely studied in computational biology. The Tandem Duplication (TD) distance problem we investigate in this paper is defined as follows: given two strings S and T over the same alphabet, compute the smallest sequence of tandem duplications required to convert S to T. The natural question of whether the TD distance can be computed in polynomial time was posed in 2004 by Leupold et al. and had remained open, despite the fact that tandem duplications have received much attention ever since. In this paper, we prove that this problem is NP-hard, settling the 16-year old open problem. We further show that this hardness holds even if all characters of S are distinct. This is known as the exemplar TD distance, which is of special relevance in bioinformatics. One of the tools we develop for the reduction is a new problem called the Cost-Effective Subgraph, for which we obtain W[1]-hardness results that might be of independent interest. We finally show that computing the exemplar TD distance between S and T is fixed-parameter tractable. Our results open the door to many other questions, and we conclude with several open problems

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Languages Generated by Iterated Idempotencies.

Author: Leupold Klaus-Peter
Publication venue: 'Universitat Rovira I Virgili'
Publication date: 01/01/2006
Field of study

The rewrite relation with parameters m and n and with the possible length limit = k or :::; k we denote by w~, =kW~· or ::;kw~ respectively. The idempotency languages generated from a starting word w by the respective operations are wDAlso other special cases of idempotency languages besides duplication have come up in different contexts. The investigations of Ito et al. about insertion and deletion, Le., operations that are also observed in DNA molecules, have established that w5 and w~ both preserve regularity.Our investigations about idempotency relations and languages start out from the case of a uniform length bound. For these relations =kW~ the conditions for confluence are characterized completely. Also the question of regularity is -k n answered for aH the languages w- D 1 are more complicated and belong to the class of context-free languages.For a generallength bound, i.e."for the relations :"::kW~, confluence does not hold so frequently. This complicatedness of the relations results also in more complicated languages, which are often non-regular, as for example the languages WWithout any length bound, idempotency relations have a very complicated structure. Over alphabets of one or two letters we still characterize the conditions for confluence. Over three or more letters, in contrast, only a few cases are solved. We determine the combinations of parameters that result in the regularity of wDIn a second chapter sorne more involved questions are solved for the special case of duplication. First we shed sorne light on the reasons why it is so difficult to determine the context-freeness ofduplication languages. We show that they fulfiH aH pumping properties and that they are very dense. Therefore aH the standard tools to prove non-context-freness do not apply here.The concept of root in Formal Language ·Theory is frequently used to describe the reduction of a word to another one, which is in sorne sense elementary.For example, there are primitive roots, periodicity roots, etc. Elementary in connection with duplication are square-free words, Le., words that do not contain any repetition. Thus we define the duplication root of w to consist of aH the square-free words, from which w can be reached via the relation w~.Besides sorne general observations we prove the decidability of the question, whether the duplication root of a language is finite.Then we devise acode, which is robust under duplication of its code words.This would keep the result of a computation from being destroyed by dupli cations in the code words. We determine the exact conditions, under which infinite such codes exist: over an alphabet of two letters they exist for a length bound of 2, over three letters already for a length bound of 1.Also we apply duplication to entire languages rather than to single words; then it is interesting to determine, whether regular and context-free languages are closed under this operation. We show that the regular languages are closed under uniformly bounded duplication, while they are not closed under duplication with a generallength bound. The context-free languages are closed under both operations.The thesis concludes with a list of open problems related with the thesis' topics

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Tesis Doctorals en Xarxa

Repositori Institucional URV

Low-redundancy codes for correcting multiple short-duplication and edit errors

Author: Farnoud Farzad
Gabrys Ryan
Lou Hao
Tang Yuanyuan
Wang Shuche
Publication venue
Publication date: 03/08/2022
Field of study

Due to its higher data density, longevity, energy efficiency, and ease of generating copies, DNA is considered a promising storage technology for satisfying future needs. However, a diverse set of errors including deletions, insertions, duplications, and substitutions may arise in DNA at different stages of data storage and retrieval. The current paper constructs error-correcting codes for simultaneously correcting short (tandem) duplications and at most

p

edits, where a short duplication generates a copy of a substring with length

\leq 3

and inserts the copy following the original substring, and an edit is a substitution, deletion, or insertion. Compared to the state-of-the-art codes for duplications only, the proposed codes correct up to

p

edits (in addition to duplications) at the additional cost of roughly

8p(\log_q n)(1+o(1))

symbols of redundancy, thus achieving the same asymptotic rate, where

q\ge 4

is the alphabet size and

p

is a constant. Furthermore, the time complexities of both the encoding and decoding processes are polynomial when

p

is a constant with respect to the code length.Comment: 21 pages. The paper has been submitted to IEEE Transaction on Information Theory. Furthermore, the paper was presented in part at the ISIT2021 and ISIT202

arXiv.org e-Print Archive