1,117 research outputs found

    Bounded prefix-suffix duplication

    Get PDF
    We consider a restricted variant of the prefix-suffix duplication operation, called bounded prefix-suffix duplication. It consists in the iterative duplication of a prefix or suffix, whose length is bounded by a constant, of a given word. We give a sufficient condition for the closure under bounded prefix-suffix duplication of a class of languages. Consequently, the class of regular languages is closed under bounded prefix-suffix duplication; furthermore, we propose an algorithm deciding whether a regular language is a finite k-prefix-suffix duplication language. An efficient algorithm solving the membership problem for the k-prefix-suffix duplication of a language is also presented. Finally, we define the k-prefix-suffix duplication distance between two words, extend it to languages and show how it can be computed for regular languages

    Ten Conferences WORDS: Open Problems and Conjectures

    Full text link
    In connection to the development of the field of Combinatorics on Words, we present a list of open problems and conjectures that were stated during the ten last meetings WORDS. We wish to continually update the present document by adding informations concerning advances in problems solving

    Detecting Breakage Fusion Bridge cycles in tumor genomes -- an algorithmic approach

    Full text link
    Breakage-Fusion-Bridge (BFB) is a mechanism of genomic instability characterized by the joining and subsequent tearing apart of sister chromatids. When this process is repeated during multiple rounds of cell division, it leads to patterns of copy number increases of chromosomal segments as well as fold-back inversions where duplicated segments are arranged head-to-head. These structural variations can then drive tumorigenesis. BFB can be observed in progress using cytogenetic techniques, but generally BFB must be inferred from data like microarrays or sequencing collected after BFB has ceased. Making correct inferences from this data is not straightforward, particularly given the complexity of some cancer genomes and BFB's ability to generate a wide range of rearrangement patterns. Here we present algorithms to aid the interpretation of evidence for BFB. We first pose the BFB count vector problem: given a chromosome segmentation and segment copy numbers, decide whether BFB can yield a chromosome with the given segment counts. We present the first linear-time algorithm for the problem, improving a previous exponential-time algorithm. We then combine this algorithm with fold-back inversions to develop tests for BFB. We show that, contingent on assumptions about cancer genome evolution, count vectors and fold-back inversions are sufficient evidence for detecting BFB. We apply the presented techniques to paired-end sequencing data from pancreatic tumors and confirm a previous finding of BFB as well as identify a new chromosomal region likely rearranged by BFB cycles, demonstrating the practicality of our approach

    Prefix-suffix duplication

    Full text link
    We consider a bio-inspired formal operation on words called prefix-suffix duplication which consists in the duplication of a prefix or suffix of a given word. The class of languages defined by the iterated application of the prefix-suffix duplication to a word is considered. We show that such a language is context-free if and only if the initial word contains just one letter. Moreover, every language in this class is semilinear and belongs to NL. We propose a 0(n2 logn) time and 0(n2 ) space recognition algorithm. Two algorithms are further proposed for computing the prefix-suffix duplication distance between two words, defined as the minimal number of prefix-suffix duplications applied to one of them in order to get the other one. The first algorithm runs in cubic time and uses quadratic space while the second one is more efficient, having 0(n2 logn) time complexity, but needs 0(n2 logn) space

    The Tandem Duplication Distance Is NP-Hard

    Get PDF
    In computational biology, tandem duplication is an important biological phenomenon which can occur either at the genome or at the DNA level. A tandem duplication takes a copy of a genome segment and inserts it right after the segment - this can be represented as the string operation AXB ? AXXB. Tandem exon duplications have been found in many species such as human, fly or worm, and have been largely studied in computational biology. The Tandem Duplication (TD) distance problem we investigate in this paper is defined as follows: given two strings S and T over the same alphabet, compute the smallest sequence of tandem duplications required to convert S to T. The natural question of whether the TD distance can be computed in polynomial time was posed in 2004 by Leupold et al. and had remained open, despite the fact that tandem duplications have received much attention ever since. In this paper, we prove that this problem is NP-hard, settling the 16-year old open problem. We further show that this hardness holds even if all characters of S are distinct. This is known as the exemplar TD distance, which is of special relevance in bioinformatics. One of the tools we develop for the reduction is a new problem called the Cost-Effective Subgraph, for which we obtain W[1]-hardness results that might be of independent interest. We finally show that computing the exemplar TD distance between S and T is fixed-parameter tractable. Our results open the door to many other questions, and we conclude with several open problems

    Languages Generated by Iterated Idempotencies.

    Get PDF
    The rewrite relation with parameters m and n and with the possible length limit = k or :::; k we denote by w~, =kW~· or ::;kw~ respectively. The idempotency languages generated from a starting word w by the respective operations are wDAlso other special cases of idempotency languages besides duplication have come up in different contexts. The investigations of Ito et al. about insertion and deletion, Le., operations that are also observed in DNA molecules, have established that w5 and w~ both preserve regularity.Our investigations about idempotency relations and languages start out from the case of a uniform length bound. For these relations =kW~ the conditions for confluence are characterized completely. Also the question of regularity is -k n answered for aH the languages w- D 1 are more complicated and belong to the class of context-free languages.For a generallength bound, i.e."for the relations :"::kW~, confluence does not hold so frequently. This complicatedness of the relations results also in more complicated languages, which are often non-regular, as for example the languages WWithout any length bound, idempotency relations have a very complicated structure. Over alphabets of one or two letters we still characterize the conditions for confluence. Over three or more letters, in contrast, only a few cases are solved. We determine the combinations of parameters that result in the regularity of wDIn a second chapter sorne more involved questions are solved for the special case of duplication. First we shed sorne light on the reasons why it is so difficult to determine the context-freeness ofduplication languages. We show that they fulfiH aH pumping properties and that they are very dense. Therefore aH the standard tools to prove non-context-freness do not apply here.The concept of root in Formal Language ·Theory is frequently used to describe the reduction of a word to another one, which is in sorne sense elementary.For example, there are primitive roots, periodicity roots, etc. Elementary in connection with duplication are square-free words, Le., words that do not contain any repetition. Thus we define the duplication root of w to consist of aH the square-free words, from which w can be reached via the relation w~.Besides sorne general observations we prove the decidability of the question, whether the duplication root of a language is finite.Then we devise acode, which is robust under duplication of its code words.This would keep the result of a computation from being destroyed by dupli cations in the code words. We determine the exact conditions, under which infinite such codes exist: over an alphabet of two letters they exist for a length bound of 2, over three letters already for a length bound of 1.Also we apply duplication to entire languages rather than to single words; then it is interesting to determine, whether regular and context-free languages are closed under this operation. We show that the regular languages are closed under uniformly bounded duplication, while they are not closed under duplication with a generallength bound. The context-free languages are closed under both operations.The thesis concludes with a list of open problems related with the thesis' topics

    Low-redundancy codes for correcting multiple short-duplication and edit errors

    Full text link
    Due to its higher data density, longevity, energy efficiency, and ease of generating copies, DNA is considered a promising storage technology for satisfying future needs. However, a diverse set of errors including deletions, insertions, duplications, and substitutions may arise in DNA at different stages of data storage and retrieval. The current paper constructs error-correcting codes for simultaneously correcting short (tandem) duplications and at most pp edits, where a short duplication generates a copy of a substring with length 3\leq 3 and inserts the copy following the original substring, and an edit is a substitution, deletion, or insertion. Compared to the state-of-the-art codes for duplications only, the proposed codes correct up to pp edits (in addition to duplications) at the additional cost of roughly 8p(logqn)(1+o(1))8p(\log_q n)(1+o(1)) symbols of redundancy, thus achieving the same asymptotic rate, where q4q\ge 4 is the alphabet size and pp is a constant. Furthermore, the time complexities of both the encoding and decoding processes are polynomial when pp is a constant with respect to the code length.Comment: 21 pages. The paper has been submitted to IEEE Transaction on Information Theory. Furthermore, the paper was presented in part at the ISIT2021 and ISIT202
    corecore